Spring 2026 - Coding workshop: Week 2

Workshop date: Friday 10 April

1. Summary

Packages

tidyverse
janitor

Operations

read in data using read_csv()
chain functions together using |>
clean column names using clean_names()
create new columns using mutate()
select columns using select()
make data frame longer using pivot_longer()
group data using group_by()
undo grouping using ungroup()
summarize data using summarize()
calculate standard deviation using sd()
calculate t-values using qt()
visualize data using ggplot()
create histograms using geom_histogram()
visualize means and raw data using geom_point()
visualize standard deviation, standard error, and confidence intervals using geom_errorbar() and geom_pointrange()
visualize trends through time using geom_point() and geom_line()

Data source

This week, we’ll work with data on seafood production types (aquaculture or capture). This workshop’s data comes from Tidy Tuesday 2021-10-12, which was from OurWorldinData.org.

2. Code

1. Set up

This section of code includes reading in the packages you’ll need: tidyverse and janitor.

You’ll also read in the data using read_csv() and store the data in an object called production.

# load in packages
library(tidyverse)
library(janitor)

# read in data
production <- read_csv("captured_vs_farmed.csv")

Note

Remember to look at the data before working with it! You can use View(production) in the console, or click on the production object in the Environment tab in the top right.

Before you start: think about what the differences might be between aquaculture and capture production in the context of fisheries. Which production type do you think would produce more seafood?

insert your best guess here

2. Cleaning up

The data comes in what’s called “wide format”, meaning that each row represents multiple observations. For example, the first row contains the production from Afghanistan (country code AFG) in 1969 for aquaculture and capture production.

We want to convert the data into “long format” so that it’s easier to work with. A dataset is in long format if each row represents an observation.

In this chunk of code, we’ll:

clean the column names using clean_names()
filter to only include the “entity” we want using filter()
select the columns of interest using select()
make the data frame longer using pivot_longer()
manipulate the type column to change the long names (e.g. aquaculture_production_metric_tons) to short names (e.g. aquaculture) using mutate() and case_when()
use mutate() to create a new column called metric_tons_mil

# creating new object from production 
production_clean <- production |> 
  # clean column names
  clean_names() |> 
  # filter to only include observations from the US
  filter(entity == "United States") |>
  # select columns of interest 
  select(year, 
         aquaculture_production_metric_tons, 
         capture_fisheries_production_metric_tons) |>
  # choose columns to pivot
  pivot_longer(cols = aquaculture_production_metric_tons:
                 capture_fisheries_production_metric_tons, 
               # column with fishery type should be named "type"
               names_to = "type", 
               # column with catch amount should be named "catch_metric_tons"
               values_to = "catch_metric_tons") |> 
  # mutate the existing type column
  mutate(type = case_when( 
    # fill in aquaculture to replace aquaculture_production_metric_tons
    type == "aquaculture_production_metric_tons" ~ "aquaculture", 
    # fill in capture to replace capture_fisheries_production_metric_tons
    type == "capture_fisheries_production_metric_tons" ~ "capture" 
  )) |> 
  # convert catch in metric tons to millions
  mutate(metric_tons_mil = catch_metric_tons/1000000)

3. Making a boxplot/jitter plot

Last week in workshop, we made a boxplot. The boxplot shows useful summary statistics (displays the central tendency and spread), while the jitter plot shows the actual observations (in this case, each point is the catch for a fishery in a given year). One way to display the underlying data is to combine a boxplot with a jitter plot.

# start with the production_clean data frame
ggplot(data = production_clean, 
       # x-axis should be type of production
       # y-axis should be metric tons of production (in millions)
       # coloring by production type
       mapping = aes(x = type, 
                     y = metric_tons_mil, 
                     color = type)) + 
  # first layer: boxplot
  geom_boxplot() + 
  # second layer: jitter
  geom_jitter(width = 0.2, # making the points jitter horizontally
              height = 0) + # making sure points don't jitter vertically
  # relabelling x- and y-axes
  labs(x = "Type", 
       y = "Metric tons of production (in millions)")

On average, a) which type of production produces more fish, and b) what components of the plot are you using to come up with your answer?

a) Capture production, because b) the median of the boxplot is way higher than the median for aquaculture

4. Making a histogram

This chunk of code creates a histogram. Note that for a histogram, you only need to fill in the aes() argument for the x-axis (x), not the y-axis. This is because ggplot() counts the number of observations in each bin for you.

Within the geom_histogram() call, you’ll need to tell R what number of bins you want using the bins argument. In this case (using the Rice Rule to determine the appropriate number of bins), we’ll use 10 bins.

# base layer: ggplot
ggplot(data = production_clean,
       # x-axis should be metric_tons_mil
       # fill by production type (aquaculture or capture)
       mapping = aes(x = metric_tons_mil,
                     fill = type)) + 
  # first layer: histogram
  geom_histogram(bins = 10, # set the number of bins based on Rice Rule
                 color = "black") # make the border of the columns black

Can you tell from looking at the histogram which production type tends to produce more fish? Why or why not?

Yes. There are no observations for aquaculture at high catch (in millions); most of the observations for aquaculture are lower than 1 million tons, while capture production ranges up to 6 million tons.

5. Visualizing spread, uncertainty, and confidence

a. Calculations

# create a new object from clean data frame
production_summary <- production_clean |> 
  # group by production type
  group_by(type) |> 
  # calculate values for each production type
  summarize(
    mean = mean(metric_tons_mil), # mean
    n = length(metric_tons_mil), # number of observations (n)
    df = n - 1, # degrees of freedom
    sd = sd(metric_tons_mil), # standard deviation
    se = sd/sqrt(n), # standard error
    tval = qt(p = 0.05/2, df = df, lower.tail = FALSE), # t value
    margin = tval*se, # margin of error
    ci_lower = mean - margin, # lower bound of the confidence interval
    ci_higher = mean + margin # upper bound of the confidence interval
  ) 

# display object
production_summary

# A tibble: 2 × 10
  type         mean     n    df    sd     se  tval margin ci_lower ci_higher
  <chr>       <dbl> <int> <dbl> <dbl>  <dbl> <dbl>  <dbl>    <dbl>     <dbl>
1 aquaculture 0.324    59    58 0.149 0.0194  2.00 0.0389    0.285     0.363
2 capture     4.38     59    58 1.20  0.156   2.00 0.313     4.07      4.69

How could you write about these confidence intervals?

Mean production in US aquaculture is 0.32 million metric tons [95% CI: 0.29, 0.36], while mean capture production is 4.38 million metric tons [95% CI: 4.07, 4.69].

b. Visualizations

When visualizing the central tendency (in this case, mean) and spread (standard deviation) or uncertainty (standard error) or confidence (confidence intervals), you can stack geoms on top of each other.

To visualize the mean, we’ll use geom_point(). Remember that geom_point() can be used for any plot you want to make that involves a point.

To visualize the spread/uncertainty/confidence, we’ll use geom_errorbar(). This is the geom that creates two lines that can extend away from a point.

Standard deviation

First, we’ll visualize standard deviation.

# base layer: ggplot
ggplot(data = production_summary, 
       # x-axis should be production type
       # y-axis is mean production
       # color by production type
       mapping = aes(x = type, 
                     y = mean, 
                     color = type)) + 
  # first layer: points to represent mean
  geom_point(size = 2) + 
  # second layer: bars to represent standard deviation
  geom_errorbar(mapping = aes(ymin = mean - sd, 
                              ymax = mean + sd),
                width = 0.1) + # make the bars narrower
  # relabelling x- and y-axes, adding title
  labs(title = "Standard deviation",
       x = "Type",
       y = "Mean and SD million metric tons production")

Standard error

Then, we want to visualize standard error.

# base layer: ggplot
ggplot(data = production_summary, 
       # x-axis should be production type
       # y-axis is mean production
       # color by production type
       mapping = aes(x = type, 
                     y = mean, 
                     color = type)) + 
  # first layer: points to represent mean
  geom_point(size = 2) + 
  # second layer: bars to represent standard error
  geom_errorbar(mapping = aes(ymin = mean - se, 
                              ymax = mean + se),
                width = 0.1) + # make the bars narrower
  # relabelling x- and y-axes, adding title
  labs(title = "Standard error",
       x = "Type",
       y = "Mean and SE million metric tons production")

95% confidence interval

Then, we want to visualize the 95% confidence interval.

# base layer: ggplot
ggplot(data = production_summary, 
       # x-axis should be production type
       # y-axis is mean production
       # color by production type
       mapping = aes(x = type, 
                     y = mean, 
                     color = type)) + 
  # first layer: points to represent mean
  geom_point(size = 2) + 
  # second layer: bars to represent 95% CI
  geom_errorbar(mapping = aes(ymin = ci_lower, 
                              ymax = ci_higher),
                # make the bars narrower
                width = 0.1) + 
  # relabelling x- and y-axes, adding title
  labs(title = "Confidence interval using geom_point and geom_errorbar",
       x = "Type",
       y = "Mean and 95% CI million metric tons production")

c. `geom_pointrange()`

We can also visualize means and spread/uncertainty/confidence intervals using the geom_pointrange() function.

# base layer: ggplot
ggplot(data = production_summary, 
       # x-axis should be production type
       # y-axis is mean production
       # color by production type
       mapping = aes(x = type, 
                     y = mean, 
                     color = type)) +
  # first layer: points representing mean,
  geom_pointrange(mapping = aes(ymin = ci_lower, 
                                ymax = ci_higher)) +
  # relabelling x- and y-axes, adding title
  labs(title = "Confidence interval (using geom_pointrange)",
       x = "Type",
       y = "Mean and 95% CI million metric tons production")

d. Visualizing with the underlying data

Lastly, we want to visualize the 95% confidence interval with the underlying data.

# base layer: ggplot
ggplot(data = production_clean,
       mapping = aes(x = type, 
                     y = metric_tons_mil, 
                     color = type)) +
  # first layer: adding data (each point shows an observation)
  geom_jitter(width = 0.1,
              height = 0,
              alpha = 0.4,
              shape = 21) +
  # second layer: means and 95% CI
  geom_pointrange(data = production_summary,
                  mapping = aes(x = type, 
                                y = mean, 
                                ymin = ci_lower, 
                                ymax = ci_higher)) +
  # custom colors
  scale_color_manual(values = c("aquaculture" = "deepskyblue",
                                "capture" = "orangered")) +
  # relabelling x- and y-axes, adding title
  labs(x = "Production type",
       y = "Million metric tons of production",
       title = "Capture produces more than aquaculture") +
  # custom theme
  theme_light() +
  # taking out legend
  theme(legend.position = "none")

ggplot themes

There are lots of themes in ggplot to play around with. These are nice to use to get rid of the grey background that is the default, and generally make your plot look cleaner.

A list of built-in themes and their theme_() function calls is here.

e. Visualizing through time

Then, if we want to visualize production through time:

# base layer: ggplot
ggplot(data = production_clean,
       # x-axis should be year, y-axis should be metric tons in millions
       # color and shape by production type
       mapping = aes(x = year,
                     y = metric_tons_mil,
                     color = type,
                     shape = type)) +
  # first layer: points
  geom_point() +
  # second layer: lines
  geom_line() +
  # manually set colors
  scale_color_manual(values = c("aquaculture" = "deepskyblue",
                                "capture" = "orangered")) +
  # relabel x- and y-axes, legend title
  labs(x = "Year",
       y = "Million metric tons of production",
       color = "Production type",
       shape = "Production type") +
  # custom theme
  theme_minimal()

END OF WORKSHOP 2

3. Extra stuff

Shapes in figures

In class, we used shape = 21 in the geom_point() call to make the points show up as open circles. By default, ggplot() uses shape = 16 for all geometries that include points.

This figure below shows the 26 options for shapes you can use in any plot with a point geometry.

Shapes 0-14 are only outlines (with a transparent fill). Shapes 15 - 20 are filled (no outline). This means you control them with color in the aes() function, and scale_color_() functions. These show up in pink in the plot below.

Shapes 21 - 25 include outlines and fills. You can manipulate both: you can change the outline using color and scale_color_() functions, and change the fill with fill and scale_fill_() functions. In the plot, outlines show up in pink and fills show up in yellow.

Credit to Albert Kuo’s blog post for inspiring me to make my own reference figure, and Alex Phillips’s colorblind friendly schemes for the colors in the figure.

Overlapping histograms

In class, we created a histogram that combined the distributions of metric tons of production for two groups: capture and aquaculture production.

I got a question about what a histogram would look like if there was overlap in the distributions of groups, and wanted to demonstrate visually what that would look like.

Here, I’m generating a data frame with fake values from distributions that have different means but overlapping spread:

set.seed(1)

# creating a data frame
test_df <- tibble(
  # generating fake values
  a = rnorm(n = 50, mean = 30, sd = 10),
  b = rnorm(n = 50, mean = 40, sd = 10)
) |> 
  # making data frame longer
  pivot_longer(
    cols = a:b,
    names_to = "group",
    values_to = "value"
  )

Then, we can compare the histograms if they are stacked (on the left) and separate (on the right):

Stacked histograms:

# base layer: ggplot
ggplot(data = test_df,
       # only supplying x-axis value
       # filling by group
       mapping = aes(x = value,
                     fill = group)) +
  # first layer: histogram
  geom_histogram(bins = 9,
                 color = "black") +
  # creating different panels for each group
  facet_wrap(~ group) +
  # edits to y-axis to get rid of space between bottom of columns and x-axis
  # changing y-axis limits
  # making y-axis breaks smaller to make it easier to see frequencies
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 36),
                     breaks = seq(from = 0, to = 36, by = 3))

Separate histograms:

# base layer: ggplot
ggplot(data = test_df,
       # only supplying x-axis value
       # filling by group
       mapping = aes(x = value,
                     fill = group)) +
  # first layer: histogram
  geom_histogram(bins = 9,
                 color = "black") +
  # creating different panels for each group
  # facet_wrap(~ group) +
  # edits to y-axis to get rid of space between bottom of columns and x-axis
  # changing y-axis limits
  # making y-axis breaks smaller to make it easier to see frequencies
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 36),
                     breaks = seq(from = 0, to = 36, by = 3))

If the distributions overlap, then the counts stack on top of each other.

--- title: "Coding workshop: Week 2" subtitle: "using Quarto, data wrangling, visualizing uncertainty" date: 2026-04-10 categories: [tidyverse, janitor, read_csv, pipe operators, '|>', clean_names, mutate, select, pivot_longer, group_by, summarize, sd, qt, ggplot, geom_histogram, geom_point, geom_errorbar, geom_pointrange, geom_line] --- [Workshop date: Friday 10 April]{style="color: #79ACBD; font-size: 24px;"} ## 1. Summary ### Packages - `tidyverse` - `janitor` ### Operations - read in data using `read_csv()` - chain functions together using ` |> ` - clean column names using `clean_names()` - create new columns using `mutate()` - select columns using `select()` - make data frame longer using `pivot_longer()` - group data using `group_by()` - undo grouping using `ungroup()` - summarize data using `summarize()` - calculate standard deviation using `sd()` - calculate t-values using `qt()` - visualize data using `ggplot()` - create histograms using `geom_histogram()` - visualize means and raw data using `geom_point()` - visualize standard deviation, standard error, and confidence intervals using `geom_errorbar()` and `geom_pointrange()` - visualize trends through time using `geom_point()` and `geom_line()` ### Data source This week, we'll work with data on seafood production types (aquaculture or capture). This workshop's data comes from [Tidy Tuesday 2021-10-12](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-10-12/readme.md), which was from [OurWorldinData.org](https://ourworldindata.org/fish-and-overfishing). ## 2. Code ### 1. Set up This section of code includes reading in the packages you'll need: `tidyverse` and `janitor`. You'll also read in the data using `read_csv()` and store the data in an object called `production`. ```{r libraries} # load in packages library(tidyverse) library(janitor) ``` ```{r student-data} #| eval: false # read in data production <- read_csv("captured_vs_farmed.csv") ``` ```{r website-data} #| echo: false production <- read_csv(here::here("workshop", "data", "captured_vs_farmed.csv")) ``` :::{.callout-note} Remember to _look_ at the data before working with it! You can use `View(production)` in the console, or click on the `production` object in the Environment tab in the top right. ::: **Before you start:** think about what the differences might be between aquaculture and capture production in the context of fisheries. Which production type do you think would produce _more_ seafood? **insert your best guess here** ### 2. Cleaning up The data comes in what's called "wide format", meaning that each row represents multiple observations. For example, the first row contains the production from Afghanistan (country code AFG) in 1969 for aquaculture and capture production. We want to convert the data into "long format" so that it's easier to work with. A dataset is in long format if each row represents an observation. In this chunk of code, we'll: 1. clean the column names using `clean_names()` 2. filter to only include the "entity" we want using `filter()` 3. select the columns of interest using `select()` 4. make the data frame longer using `pivot_longer()` 5. manipulate the `type` column to change the long names (e.g. `aquaculture_production_metric_tons`) to short names (e.g. `aquaculture`) using `mutate()` and `case_when()` 6. use `mutate()` to create a new column called `metric_tons_mil` ```{r creating-clean-dataframe} # creating new object from production production_clean <- production |> # clean column names clean_names() |> # filter to only include observations from the US filter(entity == "United States") |> # select columns of interest select(year, aquaculture_production_metric_tons, capture_fisheries_production_metric_tons) |> # choose columns to pivot pivot_longer(cols = aquaculture_production_metric_tons: capture_fisheries_production_metric_tons, # column with fishery type should be named "type" names_to = "type", # column with catch amount should be named "catch_metric_tons" values_to = "catch_metric_tons") |> # mutate the existing type column mutate(type = case_when( # fill in aquaculture to replace aquaculture_production_metric_tons type == "aquaculture_production_metric_tons" ~ "aquaculture", # fill in capture to replace capture_fisheries_production_metric_tons type == "capture_fisheries_production_metric_tons" ~ "capture" )) |> # convert catch in metric tons to millions mutate(metric_tons_mil = catch_metric_tons/1000000) ``` ### 3. Making a boxplot/jitter plot Last week in workshop, we made a boxplot. The boxplot shows useful summary statistics (displays the central tendency and spread), while the jitter plot shows the actual observations (in this case, each point is the catch for a fishery in a given year). One way to display the underlying data is to combine a boxplot with a jitter plot. ```{r} #| label: boxplot-and-jitter # start with the production_clean data frame ggplot(data = production_clean, # x-axis should be type of production # y-axis should be metric tons of production (in millions) # coloring by production type mapping = aes(x = type, y = metric_tons_mil, color = type)) + # first layer: boxplot geom_boxplot() + # second layer: jitter geom_jitter(width = 0.2, # making the points jitter horizontally height = 0) + # making sure points don't jitter vertically # relabelling x- and y-axes labs(x = "Type", y = "Metric tons of production (in millions)") ``` On average, a) which type of production produces more fish, and b) what components of the plot are you using to come up with your answer? **a) Capture production, because b) the median of the boxplot is way higher than the median for aquaculture** ### 4. Making a histogram This chunk of code creates a histogram. Note that for a histogram, you only need to fill in the `aes()` argument for the x-axis (`x`), not the y-axis. This is because `ggplot()` counts the number of observations in each bin for you. Within the `geom_histogram()` call, you'll need to tell R what number of bins you want using the `bins` argument. In this case (using the Rice Rule to determine the appropriate number of bins), we'll use 10 bins. ```{r creating-histogram} # base layer: ggplot ggplot(data = production_clean, # x-axis should be metric_tons_mil # fill by production type (aquaculture or capture) mapping = aes(x = metric_tons_mil, fill = type)) + # first layer: histogram geom_histogram(bins = 10, # set the number of bins based on Rice Rule color = "black") # make the border of the columns black ``` Can you tell from looking at the histogram which production type tends to produce more fish? Why or why not? **Yes. There are no observations for aquaculture at high catch (in millions); most of the observations for aquaculture are lower than 1 million tons, while capture production ranges up to 6 million tons.** ### 5. Visualizing spread, uncertainty, and confidence #### a. Calculations ```{r calculations} # create a new object from clean data frame production_summary <- production_clean |> # group by production type group_by(type) |> # calculate values for each production type summarize( mean = mean(metric_tons_mil), # mean n = length(metric_tons_mil), # number of observations (n) df = n - 1, # degrees of freedom sd = sd(metric_tons_mil), # standard deviation se = sd/sqrt(n), # standard error tval = qt(p = 0.05/2, df = df, lower.tail = FALSE), # t value margin = tval*se, # margin of error ci_lower = mean - margin, # lower bound of the confidence interval ci_higher = mean + margin # upper bound of the confidence interval ) # display object production_summary ``` How could you write about these confidence intervals? **Mean production in US aquaculture is 0.32 million metric tons [95% CI: 0.29, 0.36], while mean capture production is 4.38 million metric tons [95% CI: 4.07, 4.69].** #### b. Visualizations When visualizing the central tendency (in this case, mean) and spread (standard deviation) or uncertainty (standard error) or confidence (confidence intervals), you can stack geoms on top of each other. To visualize the mean, we'll use `geom_point()`. Remember that `geom_point()` can be used for any plot you want to make that involves a point. To visualize the spread/uncertainty/confidence, we'll use `geom_errorbar()`. This is the geom that creates two lines that can extend away from a point. ##### Standard deviation First, we'll visualize standard deviation. ```{r sd-plot} # base layer: ggplot ggplot(data = production_summary, # x-axis should be production type # y-axis is mean production # color by production type mapping = aes(x = type, y = mean, color = type)) + # first layer: points to represent mean geom_point(size = 2) + # second layer: bars to represent standard deviation geom_errorbar(mapping = aes(ymin = mean - sd, ymax = mean + sd), width = 0.1) + # make the bars narrower # relabelling x- and y-axes, adding title labs(title = "Standard deviation", x = "Type", y = "Mean and SD million metric tons production") ``` ##### Standard error Then, we want to visualize standard error. ```{r se-plot} # base layer: ggplot ggplot(data = production_summary, # x-axis should be production type # y-axis is mean production # color by production type mapping = aes(x = type, y = mean, color = type)) + # first layer: points to represent mean geom_point(size = 2) + # second layer: bars to represent standard error geom_errorbar(mapping = aes(ymin = mean - se, ymax = mean + se), width = 0.1) + # make the bars narrower # relabelling x- and y-axes, adding title labs(title = "Standard error", x = "Type", y = "Mean and SE million metric tons production") ``` ##### 95% confidence interval Then, we want to visualize the 95% confidence interval. ```{r ci-95-plot} # base layer: ggplot ggplot(data = production_summary, # x-axis should be production type # y-axis is mean production # color by production type mapping = aes(x = type, y = mean, color = type)) + # first layer: points to represent mean geom_point(size = 2) + # second layer: bars to represent 95% CI geom_errorbar(mapping = aes(ymin = ci_lower, ymax = ci_higher), # make the bars narrower width = 0.1) + # relabelling x- and y-axes, adding title labs(title = "Confidence interval using geom_point and geom_errorbar", x = "Type", y = "Mean and 95% CI million metric tons production") ``` #### c. `geom_pointrange()` We can also visualize means and spread/uncertainty/confidence intervals using the `geom_pointrange()` function. ```{r ci-95-plot-with-pointrange} # base layer: ggplot ggplot(data = production_summary, # x-axis should be production type # y-axis is mean production # color by production type mapping = aes(x = type, y = mean, color = type)) + # first layer: points representing mean, geom_pointrange(mapping = aes(ymin = ci_lower, ymax = ci_higher)) + # relabelling x- and y-axes, adding title labs(title = "Confidence interval (using geom_pointrange)", x = "Type", y = "Mean and 95% CI million metric tons production") ``` #### d. Visualizing with the underlying data Lastly, we want to visualize the 95% confidence interval with the underlying data. ```{r final-plot} # base layer: ggplot ggplot(data = production_clean, mapping = aes(x = type, y = metric_tons_mil, color = type)) + # first layer: adding data (each point shows an observation) geom_jitter(width = 0.1, height = 0, alpha = 0.4, shape = 21) + # second layer: means and 95% CI geom_pointrange(data = production_summary, mapping = aes(x = type, y = mean, ymin = ci_lower, ymax = ci_higher)) + # custom colors scale_color_manual(values = c("aquaculture" = "deepskyblue", "capture" = "orangered")) + # relabelling x- and y-axes, adding title labs(x = "Production type", y = "Million metric tons of production", title = "Capture produces more than aquaculture") + # custom theme theme_light() + # taking out legend theme(legend.position = "none") ``` :::{.callout-note title="ggplot themes" collapse="true"} There are lots of themes in ggplot to play around with. These are nice to use to get rid of the grey background that is the default, and generally make your plot look cleaner. A list of built-in themes and their `theme_()` function calls is [here](https://ggplot2-book.org/themes#sec-themes). ::: #### e. Visualizing through time Then, if we want to visualize production through time: ```{r time-plot} # base layer: ggplot ggplot(data = production_clean, # x-axis should be year, y-axis should be metric tons in millions # color and shape by production type mapping = aes(x = year, y = metric_tons_mil, color = type, shape = type)) + # first layer: points geom_point() + # second layer: lines geom_line() + # manually set colors scale_color_manual(values = c("aquaculture" = "deepskyblue", "capture" = "orangered")) + # relabel x- and y-axes, legend title labs(x = "Year", y = "Million metric tons of production", color = "Production type", shape = "Production type") + # custom theme theme_minimal() ``` **END OF WORKSHOP 2** ## 3. Extra stuff ### Shapes in figures In class, we used `shape = 21` in the `geom_point()` call to make the points show up as open circles. By default, `ggplot()` uses `shape = 16` for all geometries that include points. This figure below shows the 26 options for shapes you can use in any plot with a point geometry. Shapes 0-14 are only outlines (with a transparent fill). Shapes 15 - 20 are filled (no outline). This means you control them with `color` in the `aes()` function, and `scale_color_()` functions. These show up in [pink]{style="color: #db0f48; font-weight: bold"} in the plot below. Shapes 21 - 25 include outlines and fills. You can manipulate both: you can change the outline using `color` and `scale_color_()` functions, and change the fill with `fill` and `scale_fill_()` functions. In the plot, outlines show up in [pink]{style="color: #db0f48; font-weight: bold"} and fills show up in [yellow]{style="color: #f4b301; font-weight: bold"}. ```{r shapes} #| fig-height: 5 #| fig-width: 7 #| echo: false df <- tibble( point_numbers = seq(from = 0, to = 25, by = 1) ) |> mutate(type = case_when( between(point_numbers, 0, 14) ~ "Outline", between(point_numbers, 15, 20) ~ "Fill", between(point_numbers, 21, 25) ~ "Outline + Fill" ), type = fct_relevel(type, "Outline", "Fill", "Outline + Fill")) |> mutate(x_axis = case_when( point_numbers %in% c(0, 6, 12, 15, 21) ~ "a", point_numbers %in% c(1, 7, 13, 16, 22) ~ "b", point_numbers %in% c(2, 8, 14, 17, 23) ~ "c", point_numbers %in% c(3, 9, 18, 24) ~ "d", point_numbers %in% c(4, 10, 19, 25) ~ "e", point_numbers %in% c(5, 11, 20) ~ "f" )) |> mutate(y_axis = case_when( point_numbers %in% c(0, 1, 2, 3, 4, 5) ~ 5, point_numbers %in% c(6, 7, 8, 9, 10, 11) ~ 4, point_numbers %in% c(11, 12, 13, 14) ~ 3, point_numbers %in% c(15, 16, 17, 18, 19, 20) ~ 2, point_numbers %in% c(21, 22, 23, 24, 25) ~ 1 )) |> mutate(y_axis_labels = y_axis - 0.2) values <- seq(from = 0, to = 25, by = 1) yellow <- "#f4b301" pink <- "#db0f48" ggplot(data = df, aes(x = x_axis, y = y_axis, shape = as_factor(point_numbers), color = type, fill = type)) + geom_point(size = 6, stroke = 0.7) + geom_text(aes(label = point_numbers, y = y_axis_labels), color = "black") + scale_shape_manual(values = values) + scale_color_manual(values = c(pink, pink, pink)) + scale_fill_manual(values = c(yellow, yellow, yellow)) + scale_y_continuous(labels = c( "Outline + Fill \n(control with `color` or `fill`)", "Fill only \n(control with `color`)", "", "Outline only \n (control with `color`)", "" ), breaks = c(1, 2, 3, 4, 5)) + theme(legend.position = "none", strip.text.y.left = element_text(angle = 360), strip.background = element_blank(), panel.background = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(), axis.text.x = element_blank(), axis.text.y = element_text(hjust = 0.5, vjust = 0.5, size = 12)) ``` Credit to [Albert Kuo's blog post](https://blog.albertkuo.me/post/point-shape-options-in-ggplot/) for inspiring me to make my own reference figure, and Alex Phillips's [colorblind friendly schemes](https://www.nceas.ucsb.edu/sites/default/files/2022-06/Colorblind%20Safe%20Color%20Schemes.pdf) for the colors in the figure. ### Overlapping histograms In class, we created a histogram that combined the distributions of metric tons of production for two groups: capture and aquaculture production. I got a question about what a histogram would look like if there was overlap in the distributions of groups, and wanted to demonstrate visually what that would look like. Here, I'm generating a data frame with fake values from distributions that have different means but overlapping spread: ```{r} set.seed(1) # creating a data frame test_df <- tibble( # generating fake values a = rnorm(n = 50, mean = 30, sd = 10), b = rnorm(n = 50, mean = 40, sd = 10) ) |> # making data frame longer pivot_longer( cols = a:b, names_to = "group", values_to = "value" ) ``` ```{css, echo=FALSE, eval=TRUE} .multiple .leftcolumn { width: 40%; margin-top: 0px; margin-right: 50px; margin-bottom: 0px; margin-left: 250px } .multiple .rightcolumn { width:40%; margin-top: 0px; margin-right: 250px; margin-bottom: 0px; margin-left: 50px } .multiple .columns { display: flex; } @media (max-width: 767px) { .multiple .column { width: auto; } .multiple .columns { display: flex; flex-direction: column; } } ``` Then, we can compare the histograms if they are stacked (on the left) and separate (on the right): ::::: { .multiple} :::: {.columns .column-screen-inset} ::: leftcolumn Stacked histograms: ```{r} # base layer: ggplot ggplot(data = test_df, # only supplying x-axis value # filling by group mapping = aes(x = value, fill = group)) + # first layer: histogram geom_histogram(bins = 9, color = "black") + # creating different panels for each group facet_wrap(~ group) + # edits to y-axis to get rid of space between bottom of columns and x-axis # changing y-axis limits # making y-axis breaks smaller to make it easier to see frequencies scale_y_continuous(expand = c(0, 0), limits = c(0, 36), breaks = seq(from = 0, to = 36, by = 3)) ``` ::: ::: rightcolumn Separate histograms: ```{r} # base layer: ggplot ggplot(data = test_df, # only supplying x-axis value # filling by group mapping = aes(x = value, fill = group)) + # first layer: histogram geom_histogram(bins = 9, color = "black") + # creating different panels for each group # facet_wrap(~ group) + # edits to y-axis to get rid of space between bottom of columns and x-axis # changing y-axis limits # making y-axis breaks smaller to make it easier to see frequencies scale_y_continuous(expand = c(0, 0), limits = c(0, 36), breaks = seq(from = 0, to = 36, by = 3)) ``` ::: :::: ::::: If the distributions overlap, then the counts stack on top of each other.

1. Summary

Packages

Operations

Data source

2. Code

1. Set up

2. Cleaning up

3. Making a boxplot/jitter plot

4. Making a histogram

5. Visualizing spread, uncertainty, and confidence

a. Calculations

b. Visualizations

Standard deviation

Standard error

95% confidence interval

c. geom_pointrange()

d. Visualizing with the underlying data

e. Visualizing through time

3. Extra stuff

Shapes in figures

Overlapping histograms

c. `geom_pointrange()`