Visualizations with ggplot2

Bryan Mayer

2022-10-26

Getting Started

Download the workshop materials to follow along:
- https://tinyurl.com/hvtn-ggplot2-workshop
Unzip the folder.
Load the project and open ggplot2-worksheet-hvtn2022.Rmd
- The presentation link is in the Rmd.
- Follow instructions in first chunk:
  - Make sure you are in the correct directory.
  - ggplot2 is the only required library for the exercises.
  - Load in data: mock_bama_example

Code and Attribution

Many, many people have made ggplot2 tutorials!
- https://pkg.garrickadenbuie.com/gentle-ggplot2
- https://ggplot.library.duke.edu/slides/
Tidyverse help page: https://ggplot2.tidyverse.org/reference
Hadley Wickham’s text: R for Data Science

The presentation and code are publicly available on GitHub: https://github.com/bryanmayer/hvtn-ggplot2-workshop

Goal today – Template HVTN figure

Roadmap

ggplot2 basics
Building ggplots with examples.
Construct HVTN BAMA magnitude plot.
If time, piece together full HVTN figure.

ggplot2 - the quintessential example

Basic template:

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

In action:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

ggplot2 vs. base R plot

For simple plots, base plot can be useful.
- Many R packages and objects use plot() on the backend, so knowledge helpful.
The grammar of ggplot2 eases extension into complex plots.
- Ex. mapping of colors and legend generation is automatic within ggplot2.

par(mfrow = c(1,2))

plot(mpg$displ, mpg$hwy)

plot(x = mpg[mpg$class == "compact", ]$displ, 
     y =mpg[mpg$class == "compact", ]$hwy, 
     col = "orange", xlim = c(1.5, 7),
     ylab = "hwy", xlab = "displ")   
    
points(x = mpg[mpg$class == "2seater", ]$displ, 
       y = mpg[mpg$class == "2seater", ]$hwy, 
       col = "red")
    
legend(x = 3, y = 40, c("compact", "2seater"), 
       col = c("orange", "red"), pch = 1)

ggplot2 Basic Grammar

# explicitly stating 'mapping = aes()' is uncommon. "data = " necessary for geom assignments.
plA = ggplot(data = <DATA>, aes(<MAPPINGS>)) + 
  <GEOM_FUNCTION>()

plB = ggplot(aes(<MAPPINGS>)) + 
  <GEOM_FUNCTION>(data = <DATA>)

plC = ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(aes(<MAPPINGS>))

plD = ggplot() + 
  <GEOM_FUNCTION>(data = <DATA>, aes(<MAPPINGS>))

ggplot() function initializes/creates a new plot.
+ is a ggplot2 operator (function) used to build a plot over function layers.
<GEOM_FUNCTION>() (ex. geom_point() scatterplot) draws a geometry on the plot. I.e., ‘plot type’(s).
data argument for setting a data set and aes() function sets aesthetic mappings by variable.
- Within ggplot() sets global mappings across all plots.
- Within geom_**() will supersede global assignment but not apply to other geoms (local).
ggplot2 plot objects can be saved as a variables.
- Layers can be added later to a ggplot2 object.
- Advanced: The internals of the ggplot2 object can be manipulated.
- Advanced: Multiple ggplot2 objects can be stored in a list (automation).

Common Geometries

Type	Function
Scatter/Point	`geom_point()`
Line	`geom_line()`
Box plot	`geom_boxplot()`
Bar plot	`geom_bar()`, `geom_col()`
Histogram	`geom_histogram()`
Density	`geom_density()`
Regression/Spline	`geom_smooth()`
Text	`geom_text()`
Vert./Horiz. Line	`geom_{vh}line()`
Jittered Point	`geom_jitter()`
Count	`geom_count()`

https://eric.netlify.com/2017/08/10/most-popular-ggplot2-geoms/

Layering (combining) Geometries

Plots with multiple geometries can be quickly made by layering.

base_boxplot = ggplot(data = mpg, aes(x = class, y = hwy)) 
base_point = ggplot(data = mpg, aes(x = displ, y = hwy))

Aesthetic Mappings

Aesthetic mappings within aes() link data to an aesthetic.
Mappings depend on plot geometries.
- Check individual help files for information on them (ex., ?geom_text).
- Common source of error, but error message usually informative.
> base_boxplot + geom_histogram()

Error in f(): ! stat_bin() can only have an x or y aesthetic. Run rlang::last_error() to see where the error occurred.

Common Aesthetics

Aesthetic	`aes()` Argument
x-axis	`x`
y-axis	`y`
Color outline	`color`
Color fill	`fill`
Point shape/type	`shape`
Point size	`size`
Opacity	`alpha`
Line type	`linetype`
Text	`label`

Aesthetic Settings vs. Mappings

Setting: an aesthetic independent of data, assigned outside of the aes() argument.
- This must be assigned within the appropriate geometry.
- Assigning a setting within aes() is a data mapping, ggplot2 will assume this a variable with a single value.

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = "blue"))

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(color = "blue")

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + geom_point()

Continuous vs. Discrete Data

Aesthetic mappings depend on type of data.
ggplot2 will assume numeric classes are continuous.
- If the aesthetics requires a certain data, an informative error will display.
  - Depending on geometry, aesthetics may only take one type of data (ex. linetype must be categorical class).
- If the aesthetic is flexible, the plot may appear odd (potentially with warning).
You can wrap variables with functions within aes() calls.
- Ex. aes(y = log10(magnitude)), aes(y = pmax(1, magnitude)).
- Trick: Wrap numeric variables in factor() if need discrete mapping.

ggplot(data = mpg, aes(x = cyl, y = hwy)) + geom_boxplot()
ggplot(data = mpg, aes(x = factor(cyl), y = hwy)) + geom_boxplot()

Boxplot example (mpg) - setup

Question: Did highway mpg (hwy) improve between 1999 and 2008 (year)?

Data fields: hwy year (which is x?, which is y?)
Geometry? geom_boxplot
Mappings: let’s map year to color.

Take 5 minutes to make this plot. Work with your neighbor.

05:00

Box plot example (mpg) - v1

mpg_base = ggplot(mpg, aes(x = factor(year), y = hwy, colour = factor(year))) +
  geom_boxplot()

mpg_base

Why Factor?

Scale functions

ggplot(data = <DATA>, aes(<MAPPINGS>)) + 
  <GEOM_FUNCTION>()> +
  <SCALE_FUNCTION>()

Scale functions: ‘transformation’ of the aesthetic mapping.
- Specific mapping for each level of the data (ex. level 1 <=> “green”).
Scale function inputs:
- First argument (name) is always the element (legend or axis) title. Default to variable name.
- General arguments: limits, breaks, labels
  - Customize legend and axes ticks and labels.
- A way to apply transformations to the data as they are plotted.
Generating of high quality plots often requires manipulation of the scale functions.
Types: scale_*_discrete(), scale_*_manual(), scale_*_continuous()

Discrete and Manual Scale Functions

Discrete scales (scale_*_discrete()): tinkering with palettes, labels, and breaks.
- Adjusting discrete x- or y-axis (scale_x_discrete, scale_y_discrete) is not always intuitive (e.g., use limits instead of breaks).
Manual scales (scale_*_manual): explicit mapping of data level to a setting.
- Make sure breaks and labels match as expected.
- scale_color_manual(breaks = c("A", "B"), labels = c("A", "B"), values = c("red", "blue"))

ggplot(data = mpg, 
       aes(x = displ, y = hwy, colour = factor(cyl))) +
  geom_point() +
  scale_color_manual("cylinder #", breaks = c(4, 5, 6, 8),
                     labels = paste(c(4, 5, 6, 8), "cylinder"),
                     values = c("red", "green", "purple", "black"))

Manual labeling

Use named vectors for consistent repeated scale mappings.
- Ex. Same vaccine group color and label assignments across many plots.
- Note: Use ` ` for non-standard R column names.
Not shown here: reference data frame (or list) linking breaks, labels, colors
- breaks = dataframe$breaks, labels = dataframe$labels, values = dataframe$colors

cyl_colors = c(`4` = "red", `5` = "green", 
               `6` = "purple", `8` = "black")

cyl_labels =  paste(c(4, 5, 6, 8), "cylinder")
names(cyl_labels) = c(4, 5, 6, 8)

ggplot(data = mpg, 
       aes(x = displ, y = hwy, 
           colour = factor(cyl))) +
  geom_point() +
  scale_color_manual("cylinder #", 
                     values = cyl_colors,
                     labels = cyl_labels)

Continuous Scales

Continuous scale manipulation tends to apply to continuous data mapped to axes.
- Common exceptions: heat maps, point size mapping.
The default x- and y-axis scales are scale_x_continuous and scale_y_continuous.
- Most common alternative, log-transform: scale_x_log10 and scale_y_log10.
- Plotting on the log-scale vs. log-transformed data (units are different).
- Other transformations are possible.
More info: https://ggplot2.tidyverse.org/reference/index.html#section-scales

ggplot(data = mpg, 
       aes(x = displ, y = hwy)) +
  geom_point() +
  scale_y_log10()

ggplot(data = mpg, 
       aes(x = displ, y = log10(hwy))) +
  geom_point()

Coordinate functions

Coordinate functions adjust the coordinate system of the plot.
- Not direct transformation to the data: different than scale mappings. Important when summary measures (e.g., median) are computed by ggplot2.
- Always read the warnings.
Common coordinate functions:
- coord_cartesian(): default. Limit arguments can be used to zoom plot (Example).
- coord_flip(): flips x and y on the plots.
- coord_polar(): use polar coordinate system.
- coord_fixed(): fixes aspect ratio, make ‘square’ plots.

coord_ex = data.frame(
  grp = rep(LETTERS[1:2], each = 10), 
  y = rnorm(20))

ggplot(coord_ex, aes(x = grp, y = y)) + 
  geom_boxplot() + 
  scale_y_continuous(limits = c(0, NA))

ggplot(coord_ex, aes(x = grp, y = y)) + 
  geom_boxplot() + 
  coord_cartesian(ylim = c(0, NA))

Making a nice looking plot.

Adjusted title and labels for axes and legends.
- scale_y_continuous("y title", breaks = 5*3:9).
- Set breaks and labels within the appropriate scale function.
Another way to change legend and axes titles:
- labs(x = "x title", y = "y title" color = "legend title")
- Alternative axes title: xlab(), ylab()
theme():
- There are lots of plot settings to adjust (see ?theme).
- General themes can be made and applied across multiple plots.
  - ggplot2 also supplies themes: theme_bw() theme_classic()

Box plot example (mpg) - Improve

Take 5 minutes. Work with your neighbor.

Make the labels nicer.
- label x-axis as ‘year’, y-axis as ‘highway mpg’
Map year to color.
- make 1998 = red, 2008 = black
- then 1998 = black, 2008 = red.
- Can you generalize this task?
Show the raw data as points. (Bonus: jitter them.)
Apply + theme_classic()

05:00

Box plot example (mpg) - finalize

mpg_colors = c(`1999` = "black", `2008` = "red")

mpg_final = mpg_base +
  geom_point(position = position_jitter(width = 0.2)) +
  scale_color_manual(values = mpg_colors) +
  labs(x = "year", y = "highway mpg", colour = "year") +
  theme_classic()

mpg_final

Faceting

facet_grid or facet_wrap create multiple panels across a variable level.
facet_grid: structured, explicit mapping of rows and col (rectangular matrix).
- 2-var ex. + facet_grid(rows = vars(var1), cols = vars(var2)) or + facet_grid(var1 ~ var2)
- ex. + facet_grid(rows = vars(var1)) or + facet_grid(var1 ~ .)
- ex. + facet_grid(cols = vars(var1)) or + facet_grid(. ~ var1)
facet_wrap: sequence of panels across variable or layered variables.
- ex. + facet_wrap(vars(var1)) or + facet_wrap(~ var1)

Last mpg exercise: use facteting to investigate whether 1998 vs. 2008 highway mpg varies by car class.

Box plot example (mpg) - faceting

mpg_final +
  facet_wrap(~ class)

HVTN Example

Figure can be built in 3 parts. Focus just on magnitude plot.

HVTN - Data and Setup

dplyr::glimpse(mock_bama_example)

Rows: 60
Columns: 12
$ labid         <chr> "Frankenstein", "Frankenstein", "Frankenstein", "Franken…
$ pub_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ rx_code       <chr> "P", "T1", "T2", "T3", "P", "T1", "T2", "T3", "P", "T1",…
$ visitno       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
$ isotype       <chr> "IgG", "IgG", "IgG", "IgG", "IgG", "IgG", "IgG", "IgG", …
$ antigen       <chr> "ag_b", "ag_b", "ag_b", "ag_b", "ag_b", "ag_b", "ag_b", …
$ gene          <chr> "gp120", "gp120", "gp120", "gp120", "gp120", "gp120", "g…
$ clade         <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "…
$ fi_bkgd       <dbl> 11985.2477, 6188.5244, 6611.3519, 12819.1074, 5344.8246,…
$ fi_bkgd_blank <dbl> 6164.825, 6194.799, 4535.088, 4173.354, 5556.441, 6911.0…
$ delta         <dbl> 5820.4230, 1.0000, 2076.2640, 8645.7529, 1.0000, 471.266…
$ response      <dbl> 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,…

We have fixed colors and shapes, let’s assign those now:

hvtn_pl_colors = c(T1 =  "#1749FF", T2 = "#D92321", T3 = "#0AB7C9", P = "#787873", Negative = "#8F8F8F")
hvtn_pl_shapes = c(T1 =  16, T2 = 16, T3 = 16, P = 16, Negative = 2)
axis_grp_order = c("T1", "T2", "T3", "P")

Magnitude plot - Mapping (discussion)

Rows: 60
Columns: 12
$ labid         <chr> "Frankenstein", "Frankenstein", "Frankenstein", "Franken…
$ pub_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ rx_code       <chr> "P", "T1", "T2", "T3", "P", "T1", "T2", "T3", "P", "T1",…
$ visitno       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
$ isotype       <chr> "IgG", "IgG", "IgG", "IgG", "IgG", "IgG", "IgG", "IgG", …
$ antigen       <chr> "ag_b", "ag_b", "ag_b", "ag_b", "ag_b", "ag_b", "ag_b", …
$ gene          <chr> "gp120", "gp120", "gp120", "gp120", "gp120", "gp120", "g…
$ clade         <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "…
$ fi_bkgd       <dbl> 11985.2477, 6188.5244, 6611.3519, 12819.1074, 5344.8246,…
$ fi_bkgd_blank <dbl> 6164.825, 6194.799, 4535.088, 4173.354, 5556.441, 6911.0…
$ delta         <dbl> 5820.4230, 1.0000, 2076.2640, 8645.7529, 1.0000, 471.266…
$ response      <dbl> 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,…

Geometry?
Aesthetic Data Mappings
- Which data fields are used in the plot?
- How are they mapped?
- Any transformations?
Facets?
Challenges?

Magnitude plot - Exercise

Geometry: geom_boxplot, geom_point (jitter)
Data fields: magnitude (delta), groups (rx_code), response (response)
aes mappings: group to color, response to shape
Transformations: magnitude is truncated and plotted on the log10-scale.
Facets: visits (visitno)
Challenges:
- Separate color for ‘Negative’ within each group (and legend).
- Exclude non-responders from box plots.
Exercise: make magnitude box plot from worksheet skeleton.
- “Exercise 3 - HVTN Magnitude plot” ~ line 221
- Hints provided on the worksheet in the plot skeleton.
- Ignore color issue for Negative.

Magnitude plot - Skeleton (Line 221)

ggplot(mock_bama_example, aes(...)) +
  geom_boxplot(..., show.legend = F, outlier.colour = NA) +
  geom_point(..., position = position_jitter(width = 0.2)) +
  scale_shape_manual(values = c(2, 16)) +
  scale_color_manual(values = hvtn_pl_colors) +
  scale_x_discrete(limits = ...) +
  scale_y_...() +
  facet_wrap(...) +  
  labs(...) +
  theme_classic()

Geometry: geom_boxplot, geom_point (jitter)
Data fields: magnitude (delta), groups (rx_code), response (response)
aes mappings: group to color, response to shape
Transformations: magnitude is truncated and plotted on the log10-scale.
Facets: visits (visitno)
Ignore negative responder color mapping.

05:00

Response magnitude plot - v1

Response magnitude box plots - tinkering

Rethink the mapping: rx_code to x-axis, but need new map to color.

mock_bama_example$grp_response = ifelse(mock_bama_example$response == 0, 
                                         "Negative", 
                                         mock_bama_example$rx_code)

Update the plot using this variable for color mapping.

03:00

Magnitude plot - v2 Code

set.seed(14) # this keeps the jitter consistent each time

mag_bp_v2 = ggplot(data = mock_bama_example, 
       aes(x = rx_code, y = pmax(100, delta), color = grp_response)) +
  geom_boxplot(data = subset(mock_bama_example, response == 1), 
               show.legend = F, outlier.colour = NA) +
  geom_point(aes(shape = grp_response, group = rx_code), 
             position = position_jitter(width = 0.2)) +
  scale_colour_manual("", values = hvtn_pl_colors) +
  scale_shape_manual("", values = hvtn_pl_shapes) +
  scale_x_discrete(limits = axis_grp_order) +
  scale_y_log10(breaks = c(100, 1000, 2000, 3000, 5000, 10000),
                label = c("< 100", 1000, 2000, 3000, 5000, 10000)) +
  facet_wrap(~visitno) +
  labs(y = "IgG binding antibody units (MFI - blank)", x = "") +
  theme_classic() +
  theme(panel.background = element_rect(color="black"), legend.position = "bottom")

Response magnitude box plots - version 2

To do before final figure: tinkering with theme().
- Remove panel (strip) labels for stacking. strip.text = element_blank()
- Wrap individual plots with boxes. panel.background = element_rect(color="black").

Finalized magnitude figure

Finishing HVTN Example

Walkthrough parts 2 and 3. Code is in the worksheet.

Responder total tally - step 1

Approach: Summarize the data and get the text aligned on a plot.

response_rate_summary = mock_bama_example %>%
  group_by(rx_code, antigen, visitno) %>%
  summarize(
    Total = n(),
    Responders = sum(response),
    rr = mean(response),
    .groups = "drop"
  )

rr_txt_pl_skeleton = ggplot(response_rate_summary, 
                            aes(x = rx_code, y = 1, 
                                label = paste(Responders, Total, sep = "\n"))) +
  geom_text(size = 3) +
  scale_x_discrete(limits = axis_grp_order) +
  scale_y_continuous(breaks = 1, labels = "Responders\nTotal", expand = c(0, 0)) +
  facet_wrap(~visitno) +
  labs(y = "") 

rr_txt_pl_skeleton

Responder total tally - step 1

Responder total tally - finalize

Use theme() to remove most plot elements. The axes coordinates are invisible but remain aligned with group.

rr_txt_pl = rr_txt_pl_skeleton +
  theme(panel.background = element_blank(), axis.text.x = element_blank(),
        axis.title.x = element_blank(), axis.ticks = element_blank(),
        strip.background = element_blank(), strip.text = element_blank(), 
        axis.text.y = element_text(size = 10), plot.margin = margin())


rr_txt_pl

Responder total tally - finalize

Response rate barplot

Two potential approaches (both in worksheet):
- Use stat_summary_bin applied to the full dataset.
- Use summarized data from tallies and geom_bar(stat = 'identity') (see worksheet).

resp_rate1 = ggplot(mock_bama_example, aes(x = rx_code, y = 100*response, fill = rx_code)) +
  stat_summary_bin(fun = "mean", geom = "bar") +
  scale_fill_manual(values = hvtn_pl_colors, guide = "none") +
  scale_x_discrete(limits = axis_grp_order) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 102)) +
  labs(y = "Response rate (%)", x = "") +
  facet_wrap(~visitno, labeller = 
               as_labeller(c(`5` = "Visit 5",
                             `7` = "Visit 7",
                             `11` = "Visit 11"))) +
  theme_classic() +
  theme(strip.background = element_blank(), panel.background = element_rect(color="black"),
        plot.title = element_text(size = 12)) + 
  ggtitle("HVTN Trial IgG Antibody Responses to ag_b")

resp_rate1

Response rate barplot

Final Figure

Use cowplot::plot_grid

plot_grid(
  resp_rate1 + theme(axis.text.x = element_blank(),  
                     axis.ticks.x = element_blank()), 
  rr_txt_pl + theme(plot.margin = margin(t = -20)), 
  mag_final, 
  rel_heights = c(18, 1.75, 30), 
  nrow = 3, align = "v", axis = "lr")

Final Figure

Extra slides

V1 Plot Answer

ggplot(data = mock_bama_example, aes(x = rx_code, y = pmax(100, delta), 
                                                 color = rx_code)) +
  geom_boxplot(data = subset(mock_bama_example, response == 1), 
               show.legend = F, 
               outlier.colour = NA) +
  geom_point(aes(shape = factor(response), group = rx_code), 
             position = position_jitter(width = 0.2)) +
  scale_colour_manual("", values = hvtn_pl_colors) +
  scale_shape_manual("", values = c(2, 16)) +
  scale_x_discrete(limits = axis_grp_order) +
  scale_y_log10(breaks = c(100, 1000, 2000, 3000, 5000, 10000)) +
  facet_wrap(~ visitno) +
  labs(y = "IgG binding antibody units (MFI - blank)", x = "") +
  theme_classic()

Visualizations with ggplot2

Getting Started

Code and Attribution

Many, many people have made ggplot2 tutorials!

Goal today – Template HVTN figure

Roadmap

ggplot2 - the quintessential example

ggplot2 vs. base R plot

ggplot2 Basic Grammar

Common Geometries

Layering (combining) Geometries

Aesthetic Mappings

Common Aesthetics

Aesthetic Settings vs. Mappings

Continuous vs. Discrete Data

Boxplot example (mpg) - setup

Box plot example (mpg) - v1

Scale functions

Discrete and Manual Scale Functions

Manual labeling

Continuous Scales

Coordinate functions

Making a nice looking plot.

Box plot example (mpg) - Improve

Box plot example (mpg) - finalize

Faceting

Box plot example (mpg) - faceting

HVTN Example

HVTN - Data and Setup

Magnitude plot - Mapping (discussion)

Magnitude plot - Exercise

Magnitude plot - Skeleton (Line 221)

Response magnitude plot - v1

Response magnitude box plots - tinkering

Magnitude plot - v2 Code

Response magnitude box plots - version 2

Finalized magnitude figure

Finishing HVTN Example

Responder total tally - step 1

Responder total tally - step 1

Responder total tally - finalize

Responder total tally - finalize

Response rate barplot

Response rate barplot

Final Figure

Final Figure

Extra slides

V1 Plot Answer