DATA 505: Statistics Using R

Week 5: Data manipulation & visualization

Prof. Yi Lu

2025-10-28

Prep work

The following packages are required for today’s lecture. Please install first if needed.

library(tidyverse) # for data manipulation, ggplots is part of this
library(maps)
library(plotly) # for interactive plots

To start, we’ll use a very simple dataset in Base R.

?USArrests
str(USArrests)

Create a separate variable for state names; create a categorical (ordinal) variable based on Urban Population. More on this syntax later.

USArrests <- USArrests |> 
  mutate(State = row.names(USArrests)) |>
  mutate(UrbanPop_tile = ntile(UrbanPop, 3)) |>
  mutate(UrbanPop_Cat = factor(UrbanPop_tile, 
                               levels=c(1,2,3),
                               labels=c('Low',"Medium","High"),
                               ordered=T))

Base R plots

Histogram

  • Histogram is a uni-variate graph. To describe a distribution, keep in mind properties of location, spread, and shape.
  • Visualization elements can be modified by adding arguments within the function
hist(USArrests$Murder,
      nclass=30,
     main='Murder rate per 100000',
    col='pink', 
     ylab=NA, 
     yaxt='n',
    xlab=NA)

Boxplot

  • Boxplot is based on the 5-number summary: min, Q1, Q2 (median), Q3, max
boxplot(USArrests$Murder)

Stem and leaf plot

  • Different software often produce stem-and-leaf plots that look differently
  • In R, can change scale and width arguments
  • Only sometimes useful
stem(USArrests$Murder)

  The decimal point is at the |

   0 | 8
   2 | 11226672348
   4 | 0349379
   6 | 003682349
   8 | 158007
  10 | 04134
  12 | 127022
  14 | 444
  16 | 14

Side-by-side boxplots

boxplot(USArrests$Assault ~ USArrests$UrbanPop_Cat,
        col='pink',
        border='darkblue',
        pch=16,
        xlab='Urban population',
        ylab='Assault rate')

Scatterplot

  • Scatterplots are bi-variate; useful to detect relationship between two numeric (scale) variables
plot(USArrests$UrbanPop,USArrests$Assault)

ggplot2

Example 1

  • ggplot2 is an alternative, more unified framework (Grammar of Graphics) for creating plots and graphics in R
ggplot(data=USArrests,
       mapping=aes(x=UrbanPop,y=Assault)) + 
  geom_point()

Example 2

ggplot(data=USArrests,
       mapping=aes(x=UrbanPop_Cat,y = Assault, fill=UrbanPop_Cat)) + 
  geom_boxplot()

Basic ingredients of a ggplot

  1. Data
    • Data frame containing all the raw observations
  2. Aesthetic mapping
    • An aesthetic is a visual property of the plot, such as x-axis, y-axis, color, fill, size, etc.
    • Use aes to specify which variable is mapped to which aesthetic
    • Example: height -> x-axis; weight -> y-axis; gender -> fill; name -> text
  3. Geoms
    • What types of layers should be added to the plots

Example 1 explained

  • UrbanPop maps to the x-axis
  • Assault maps to the y-axis
  • geom_point draws scatterplot-type points for each observation
ggplot(data=USArrests,
       mapping=aes(x=UrbanPop,y=Assault)) + 
  geom_point()

Example 2 explained

  • UrbanPop_Cat maps to the x-axis AND fill (this kind of redundant mapping should be avoided!)
  • Assault maps to the y-axis
  • geom_boxplot draws boxplots
  • Try: change geom_boxplot to geom_violin
ggplot(data=USArrests,
       mapping=aes(
        x=UrbanPop_Cat,
        y = Assault,
        fill=UrbanPop_Cat
       )) + 
  geom_boxplot()

geom_histogram

  • Recall: histogram is univariate; y-axis is always frequency, so we don’t need a y=_ mapping
ggplot(data = USArrests,  aes(x = Murder)) + 
  geom_histogram()

  • Can add arguments in geoms
ggplot(data = USArrests,  aes(x = Murder)) + 
  geom_histogram(binwidth=5,
                 fill="orange",
                 color="blue")
  • Note the differences in the syntax convention:
    • base R col vs ggplot fill
    • base R border vs ggplot color
    • esp. “color” (ggplot) vs “col” (base R)

  • Add another dimension (i.e., map another variable to another aesthetic)
ggplot(data = USArrests, aes(x = Murder, fill=UrbanPop_Cat)) + 
  geom_histogram(binwidth=5, color="blue")

geom_bar

  • The bar graph shows that each category has roughly the same number of states, as expected.
ggplot(data=USArrests, 
       mapping=aes(x=UrbanPop_Cat)) +
  geom_bar()

geom_text

ggplot(data=USArrests,
       aes(x=Murder,y=Assault,label=State)) + 
  geom_text()
  • Try: change geom_text to geom_label
  • Try: add geom_point

Customize ggplot with +

ggplot offers a very convenient syntax (+) to add elements and layers

ggplot(data=USArrests, 
       mapping=aes(x=UrbanPop_Cat)) +
  geom_bar() + 
  theme_bw() +
  ggtitle('Bar graph') +
  labs(x='Urban population percentage')

Example: change gradient color

ggplot(data=USArrests,
       aes(x=Murder,y=Assault,color=UrbanPop)) + 
  geom_point()+
  theme_bw() +
  ggtitle('Violent Crime Rates') +
  labs(x='Murder',y="Assault",color='Urban Pop Percentage') +
  scale_color_gradient(low="lightgreen",high="black") 

Example: change categorical color

ggplot(data=USArrests, 
       aes(x=Murder,fill=UrbanPop_Cat)) +
  geom_histogram() +
  scale_fill_manual(values=c("Low"='green',"Medium"='purple', "High"="darkgrey")) +
  theme_minimal()

There’s more!

  • There are many other geoms and customization available!
  • Each geom has its own set of aesthetics to which variables can be mapped.
  • Bring up the help page to check what aesthetics are needed.
  • Many references can be found online - they all share the same, unified structure that you can now understand.

Data manipulation: tidyverse

The Pipe operator

  • tidyverse packages: https://www.tidyverse.org/packages/

  • Powerful and popular for working with complicated dataset, typically formatted into a tidy format

  • Importantly, the pipe operator offers a new way of calling a function

  • These three lines of code are equivalent

summary(USArrests) 
USArrests |> summary # magrittr / dplyr pipe
USArrests |> summary # base R pipe (R >= 4.0)
  • Functions in tidyverse are typically used with the pipe operator, but this is not required. For instance, these code are equivalent
USArrests |> filter(UrbanPop > 80) |> select(Murder, UrbanPop) 
select(filter(USArrests,UrbanPop > 80), Murder, UrbanPop)

Useful tidyverse functions

# filter: subset data according to some condition
USArrests |> 
  filter(UrbanPop_Cat=='High', Murder > 10)

# slice: subset specific rows
# select: subset specific columns
USArrests |> 
  slice(1:5) |>
  select(Murder,Assault)

# arrange: sort by a variable
USArrests |> 
  arrange(desc(Murder)) |>
  slice(1:5)

# summarise: calculate summary statistics
USArrests |> 
  summarise(mean(Murder),median(Murder),count=length(Murder))


# mutate: create new variables
USArrests |> 
  mutate(Violent = Murder + Assault + Rape) |>
  arrange(desc(Violent)) |>
  select(Violent)

# group_by: look at subset groups of data
USArrests |> 
  group_by(UrbanPop_Cat) |>
  summarise(mean(Murder), median(Murder))

Integration with ggplot

  • The pipe operator works very well with ggplot syntax
USArrests |> 
  mutate(Urban = ifelse(UrbanPop > 80, "Yes", "No")) |>
  ggplot(aes(x=Murder, y=Assault, color=Urban)) +
  geom_point() +
  theme_bw()

Combine data: map example

Map

  • There are a few different maps included in the maps package
# see ?map_data
mymap <- map_data(map="state")
  • What exactly is mymap? Try some basic visualizations…
plot(mymap$long,mymap$lat)

ggplot(data=mymap,aes(x=long,y=lat,group=group)) +
  geom_polygon() +
  coord_fixed(1.4) +
  theme_void()

plot(mymap$long,mymap$lat)

ggplot(data=mymap,aes(x=long,y=lat,group=group)) +
  geom_polygon() +
  coord_fixed(1.4) +
  theme_void()

Combine data

  • We want to combine the following dataset
  • The “State” column in USArrests should match the “region” column in mymap; the state names should have consistent spelling and capitalization
USArrests |> slice_head(n=3)
        Murder Assault UrbanPop Rape   State UrbanPop_tile UrbanPop_Cat
Alabama   13.2     236       58 21.2 Alabama             1          Low
Alaska    10.0     263       48 44.5  Alaska             1          Low
Arizona    8.1     294       80 31.0 Arizona             3         High
mymap |> slice_head(n=3)
       long      lat group order  region subregion
1 -87.46201 30.38968     1     1 alabama      <NA>
2 -87.48493 30.37249     1     2 alabama      <NA>
3 -87.52503 30.37249     1     3 alabama      <NA>

# convert county name to lower case
USArrests$State=tolower(USArrests$State)

# combine data with left_join
mymap_comb = left_join(mymap, USArrests, 
                     by=c('region'='State'))

mymap_comb |> slice_head(n=3)
       long      lat group order  region subregion Murder Assault UrbanPop Rape
1 -87.46201 30.38968     1     1 alabama      <NA>   13.2     236       58 21.2
2 -87.48493 30.37249     1     2 alabama      <NA>   13.2     236       58 21.2
3 -87.52503 30.37249     1     3 alabama      <NA>   13.2     236       58 21.2
  UrbanPop_tile UrbanPop_Cat
1             1          Low
2             1          Low
3             1          Low

Visualization!

mymap_comb |>
  ggplot(aes(x=long,y=lat,group=group,
             fill=Murder)) +
  geom_polygon(color='darkblue') +
  coord_fixed(1.4) +
  theme_void() +
  labs(fill='Murder rate') +
  scale_fill_gradient(low="green", high="red")

Different ways to join data

  • Run ?left_join() to understand different ways to join data.
  • Discussion: what will the following code return? You can run the code to find out.
df.student
  name exam project
1    A   80       1
2    B   91       2
3    C   85       1
4    D   78      NA
df.project
  projectID    description
1         1  visualization
2         2 classification
3         3    text mining
inner_join(df.student, df.project, by=c('project' = 'projectID'))
left_join(df.student, df.project, by=c('project' = 'projectID'))
left_join(df.project, df.student, by=c('projectID' = 'project'))
right_join(df.student, df.project, by=c('project' = 'projectID'))
full_join(df.student, df.project, by=c('project' = 'projectID'))