Coding workshop: Week 2

using Quarto, data wrangling, visualizing uncertainty

tidyverse
janitor
read_csv
pipe operators
|>
clean_names
mutate
select
pivot_longer
group_by
summarize
sd
qt
ggplot
geom_histogram
geom_point
geom_errorbar
geom_pointrange
geom_line
Author
Affiliation
Published

April 10, 2026

Modified

April 10, 2026

Workshop date: Friday 10 April

1. Summary

Packages

  • tidyverse
  • janitor

Operations

  • read in data using read_csv()
  • chain functions together using |>
  • clean column names using clean_names()
  • create new columns using mutate()
  • select columns using select()
  • make data frame longer using pivot_longer()
  • group data using group_by()
  • undo grouping using ungroup()
  • summarize data using summarize()
  • calculate standard deviation using sd()
  • calculate t-values using qt()
  • visualize data using ggplot()
  • create histograms using geom_histogram()
  • visualize means and raw data using geom_point()
  • visualize standard deviation, standard error, and confidence intervals using geom_errorbar() and geom_pointrange()
  • visualize trends through time using geom_point() and geom_line()

Data source

This week, we’ll work with data on seafood production types (aquaculture or capture). This workshop’s data comes from Tidy Tuesday 2021-10-12, which was from OurWorldinData.org.

2. Code

1. Set up

This section of code includes reading in the packages you’ll need: tidyverse and janitor.

You’ll also read in the data using read_csv() and store the data in an object called production.

# load in packages
library(tidyverse)
library(janitor)
# read in data
production <- read_csv("captured_vs_farmed.csv")
Note

Remember to look at the data before working with it! You can use View(production) in the console, or click on the production object in the Environment tab in the top right.

Before you start: think about what the differences might be between aquaculture and capture production in the context of fisheries. Which production type do you think would produce more seafood?

insert your best guess here

2. Cleaning up

The data comes in what’s called “wide format”, meaning that each row represents multiple observations. For example, the first row contains the production from Afghanistan (country code AFG) in 1969 for aquaculture and capture production.

We want to convert the data into “long format” so that it’s easier to work with. A dataset is in long format if each row represents an observation.

In this chunk of code, we’ll:

  1. clean the column names using clean_names()
  2. filter to only include the “entity” we want using filter()
  3. select the columns of interest using select()
  4. make the data frame longer using pivot_longer()
  5. manipulate the type column to change the long names (e.g. aquaculture_production_metric_tons) to short names (e.g. aquaculture) using mutate() and case_when()
  6. use mutate() to create a new column called metric_tons_mil
# creating new object from production 
production_clean <- production |> 
  # clean column names
  clean_names() |> 
  # filter to only include observations from the US
  filter(entity == "United States") |>
  # select columns of interest 
  select(year, 
         aquaculture_production_metric_tons, 
         capture_fisheries_production_metric_tons) |>
  # choose columns to pivot
  pivot_longer(cols = aquaculture_production_metric_tons:
                 capture_fisheries_production_metric_tons, 
               # column with fishery type should be named "type"
               names_to = "type", 
               # column with catch amount should be named "catch_metric_tons"
               values_to = "catch_metric_tons") |> 
  # mutate the existing type column
  mutate(type = case_when( 
    # fill in aquaculture to replace aquaculture_production_metric_tons
    type == "aquaculture_production_metric_tons" ~ "aquaculture", 
    # fill in capture to replace capture_fisheries_production_metric_tons
    type == "capture_fisheries_production_metric_tons" ~ "capture" 
  )) |> 
  # convert catch in metric tons to millions
  mutate(metric_tons_mil = catch_metric_tons/1000000) 

3. Making a boxplot/jitter plot

Last week in workshop, we made a boxplot. The boxplot shows useful summary statistics (displays the central tendency and spread), while the jitter plot shows the actual observations (in this case, each point is the catch for a fishery in a given year). One way to display the underlying data is to combine a boxplot with a jitter plot.

# start with the production_clean data frame
ggplot(data = production_clean, 
       # x-axis should be type of production
       # y-axis should be metric tons of production (in millions)
       # coloring by production type
       mapping = aes(x = type, 
                     y = metric_tons_mil, 
                     color = type)) + 
  # first layer: boxplot
  geom_boxplot() + 
  # second layer: jitter
  geom_jitter(width = 0.2, # making the points jitter horizontally
              height = 0) + # making sure points don't jitter vertically
  # relabelling x- and y-axes
  labs(x = "Type", 
       y = "Metric tons of production (in millions)") 

On average, a) which type of production produces more fish, and b) what components of the plot are you using to come up with your answer?

a) Capture production, because b) the median of the boxplot is way higher than the median for aquaculture

4. Making a histogram

This chunk of code creates a histogram. Note that for a histogram, you only need to fill in the aes() argument for the x-axis (x), not the y-axis. This is because ggplot() counts the number of observations in each bin for you.

Within the geom_histogram() call, you’ll need to tell R what number of bins you want using the bins argument. In this case (using the Rice Rule to determine the appropriate number of bins), we’ll use 10 bins.

# base layer: ggplot
ggplot(data = production_clean,
       # x-axis should be metric_tons_mil
       # fill by production type (aquaculture or capture)
       mapping = aes(x = metric_tons_mil,
                     fill = type)) + 
  # first layer: histogram
  geom_histogram(bins = 10, # set the number of bins based on Rice Rule
                 color = "black") # make the border of the columns black

Can you tell from looking at the histogram which production type tends to produce more fish? Why or why not?

Yes. There are no observations for aquaculture at high catch (in millions); most of the observations for aquaculture are lower than 1 million tons, while capture production ranges up to 6 million tons.

5. Visualizing spread, uncertainty, and confidence

a. Calculations

# create a new object from clean data frame
production_summary <- production_clean |> 
  # group by production type
  group_by(type) |> 
  # calculate values for each production type
  summarize(
    mean = mean(metric_tons_mil), # mean
    n = length(metric_tons_mil), # number of observations (n)
    df = n - 1, # degrees of freedom
    sd = sd(metric_tons_mil), # standard deviation
    se = sd/sqrt(n), # standard error
    tval = qt(p = 0.05/2, df = df, lower.tail = FALSE), # t value
    margin = tval*se, # margin of error
    ci_lower = mean - margin, # lower bound of the confidence interval
    ci_higher = mean + margin # upper bound of the confidence interval
  ) 

# display object
production_summary
# A tibble: 2 × 10
  type         mean     n    df    sd     se  tval margin ci_lower ci_higher
  <chr>       <dbl> <int> <dbl> <dbl>  <dbl> <dbl>  <dbl>    <dbl>     <dbl>
1 aquaculture 0.324    59    58 0.149 0.0194  2.00 0.0389    0.285     0.363
2 capture     4.38     59    58 1.20  0.156   2.00 0.313     4.07      4.69 

How could you write about these confidence intervals?

Mean production in US aquaculture is 0.32 million metric tons [95% CI: 0.29, 0.36], while mean capture production is 4.38 million metric tons [95% CI: 4.07, 4.69].

b. Visualizations

When visualizing the central tendency (in this case, mean) and spread (standard deviation) or uncertainty (standard error) or confidence (confidence intervals), you can stack geoms on top of each other.

To visualize the mean, we’ll use geom_point(). Remember that geom_point() can be used for any plot you want to make that involves a point.

To visualize the spread/uncertainty/confidence, we’ll use geom_errorbar(). This is the geom that creates two lines that can extend away from a point.

Standard deviation

First, we’ll visualize standard deviation.

# base layer: ggplot
ggplot(data = production_summary, 
       # x-axis should be production type
       # y-axis is mean production
       # color by production type
       mapping = aes(x = type, 
                     y = mean, 
                     color = type)) + 
  # first layer: points to represent mean
  geom_point(size = 2) + 
  # second layer: bars to represent standard deviation
  geom_errorbar(mapping = aes(ymin = mean - sd, 
                              ymax = mean + sd),
                width = 0.1) + # make the bars narrower
  # relabelling x- and y-axes, adding title
  labs(title = "Standard deviation",
       x = "Type",
       y = "Mean and SD million metric tons production")

Standard error

Then, we want to visualize standard error.

# base layer: ggplot
ggplot(data = production_summary, 
       # x-axis should be production type
       # y-axis is mean production
       # color by production type
       mapping = aes(x = type, 
                     y = mean, 
                     color = type)) + 
  # first layer: points to represent mean
  geom_point(size = 2) + 
  # second layer: bars to represent standard error
  geom_errorbar(mapping = aes(ymin = mean - se, 
                              ymax = mean + se),
                width = 0.1) + # make the bars narrower
  # relabelling x- and y-axes, adding title
  labs(title = "Standard error",
       x = "Type",
       y = "Mean and SE million metric tons production")

95% confidence interval

Then, we want to visualize the 95% confidence interval.

# base layer: ggplot
ggplot(data = production_summary, 
       # x-axis should be production type
       # y-axis is mean production
       # color by production type
       mapping = aes(x = type, 
                     y = mean, 
                     color = type)) + 
  # first layer: points to represent mean
  geom_point(size = 2) + 
  # second layer: bars to represent 95% CI
  geom_errorbar(mapping = aes(ymin = ci_lower, 
                              ymax = ci_higher),
                # make the bars narrower
                width = 0.1) + 
  # relabelling x- and y-axes, adding title
  labs(title = "Confidence interval using geom_point and geom_errorbar",
       x = "Type",
       y = "Mean and 95% CI million metric tons production")

c. geom_pointrange()

We can also visualize means and spread/uncertainty/confidence intervals using the geom_pointrange() function.

# base layer: ggplot
ggplot(data = production_summary, 
       # x-axis should be production type
       # y-axis is mean production
       # color by production type
       mapping = aes(x = type, 
                     y = mean, 
                     color = type)) +
  # first layer: points representing mean,
  geom_pointrange(mapping = aes(ymin = ci_lower, 
                                ymax = ci_higher)) +
  # relabelling x- and y-axes, adding title
  labs(title = "Confidence interval (using geom_pointrange)",
       x = "Type",
       y = "Mean and 95% CI million metric tons production")

d. Visualizing with the underlying data

Lastly, we want to visualize the 95% confidence interval with the underlying data.

# base layer: ggplot
ggplot(data = production_clean,
       mapping = aes(x = type, 
                     y = metric_tons_mil, 
                     color = type)) +
  # first layer: adding data (each point shows an observation)
  geom_jitter(width = 0.1,
              height = 0,
              alpha = 0.4,
              shape = 21) +
  # second layer: means and 95% CI
  geom_pointrange(data = production_summary,
                  mapping = aes(x = type, 
                                y = mean, 
                                ymin = ci_lower, 
                                ymax = ci_higher)) +
  # custom colors
  scale_color_manual(values = c("aquaculture" = "deepskyblue",
                                "capture" = "orangered")) +
  # relabelling x- and y-axes, adding title
  labs(x = "Production type",
       y = "Million metric tons of production",
       title = "Capture produces more than aquaculture") +
  # custom theme
  theme_light() +
  # taking out legend
  theme(legend.position = "none")

There are lots of themes in ggplot to play around with. These are nice to use to get rid of the grey background that is the default, and generally make your plot look cleaner.

A list of built-in themes and their theme_() function calls is here.

e. Visualizing through time

Then, if we want to visualize production through time:

# base layer: ggplot
ggplot(data = production_clean,
       # x-axis should be year, y-axis should be metric tons in millions
       # color and shape by production type
       mapping = aes(x = year,
                     y = metric_tons_mil,
                     color = type,
                     shape = type)) +
  # first layer: points
  geom_point() +
  # second layer: lines
  geom_line() +
  # manually set colors
  scale_color_manual(values = c("aquaculture" = "deepskyblue",
                                "capture" = "orangered")) +
  # relabel x- and y-axes, legend title
  labs(x = "Year",
       y = "Million metric tons of production",
       color = "Production type",
       shape = "Production type") +
  # custom theme
  theme_minimal()

END OF WORKSHOP 2

3. Extra stuff

Shapes in figures

In class, we used shape = 21 in the geom_point() call to make the points show up as open circles. By default, ggplot() uses shape = 16 for all geometries that include points.

This figure below shows the 26 options for shapes you can use in any plot with a point geometry.

Shapes 0-14 are only outlines (with a transparent fill). Shapes 15 - 20 are filled (no outline). This means you control them with color in the aes() function, and scale_color_() functions. These show up in pink in the plot below.

Shapes 21 - 25 include outlines and fills. You can manipulate both: you can change the outline using color and scale_color_() functions, and change the fill with fill and scale_fill_() functions. In the plot, outlines show up in pink and fills show up in yellow.

Credit to Albert Kuo’s blog post for inspiring me to make my own reference figure, and Alex Phillips’s colorblind friendly schemes for the colors in the figure.

Overlapping histograms

In class, we created a histogram that combined the distributions of metric tons of production for two groups: capture and aquaculture production.

I got a question about what a histogram would look like if there was overlap in the distributions of groups, and wanted to demonstrate visually what that would look like.

Here, I’m generating a data frame with fake values from distributions that have different means but overlapping spread:

set.seed(1)

# creating a data frame
test_df <- tibble(
  # generating fake values
  a = rnorm(n = 50, mean = 30, sd = 10),
  b = rnorm(n = 50, mean = 40, sd = 10)
) |> 
  # making data frame longer
  pivot_longer(
    cols = a:b,
    names_to = "group",
    values_to = "value"
  )

Then, we can compare the histograms if they are stacked (on the left) and separate (on the right):

Stacked histograms:

# base layer: ggplot
ggplot(data = test_df,
       # only supplying x-axis value
       # filling by group
       mapping = aes(x = value,
                     fill = group)) +
  # first layer: histogram
  geom_histogram(bins = 9,
                 color = "black") +
  # creating different panels for each group
  facet_wrap(~ group) +
  # edits to y-axis to get rid of space between bottom of columns and x-axis
  # changing y-axis limits
  # making y-axis breaks smaller to make it easier to see frequencies
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 36),
                     breaks = seq(from = 0, to = 36, by = 3)) 

Separate histograms:

# base layer: ggplot
ggplot(data = test_df,
       # only supplying x-axis value
       # filling by group
       mapping = aes(x = value,
                     fill = group)) +
  # first layer: histogram
  geom_histogram(bins = 9,
                 color = "black") +
  # creating different panels for each group
  # facet_wrap(~ group) +
  # edits to y-axis to get rid of space between bottom of columns and x-axis
  # changing y-axis limits
  # making y-axis breaks smaller to make it easier to see frequencies
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 36),
                     breaks = seq(from = 0, to = 36, by = 3)) 

If the distributions overlap, then the counts stack on top of each other.