DATA 505: Statistics Using R

Week 5: Data manipulation & visualization

Prof. Yi Lu

2025-10-28

Prep work

The following packages are required for today’s lecture. Please install first if needed.

library(tidyverse) # for data manipulation, ggplots is part of this
library(maps)
library(plotly) # for interactive plots

To start, we’ll use a very simple dataset in Base R.

?USArrests
str(USArrests)

Create a separate variable for state names; create a categorical (ordinal) variable based on Urban Population. More on this syntax later.

USArrests <- USArrests |> 
  mutate(State = row.names(USArrests)) |>
  mutate(UrbanPop_tile = ntile(UrbanPop, 3)) |>
  mutate(UrbanPop_Cat = factor(UrbanPop_tile, 
                               levels=c(1,2,3),
                               labels=c('Low',"Medium","High"),
                               ordered=T))

Base R plots

Histogram

Histogram is a uni-variate graph. To describe a distribution, keep in mind properties of location, spread, and shape.
Visualization elements can be modified by adding arguments within the function

hist(USArrests$Murder,
      nclass=30,
     main='Murder rate per 100000',
    col='pink', 
     ylab=NA, 
     yaxt='n',
    xlab=NA)

Boxplot

Boxplot is based on the 5-number summary: min, Q1, Q2 (median), Q3, max

boxplot(USArrests$Murder)

Stem and leaf plot

Different software often produce stem-and-leaf plots that look differently
In R, can change scale and width arguments
Only sometimes useful

stem(USArrests$Murder)


  The decimal point is at the |

   0 | 8
   2 | 11226672348
   4 | 0349379
   6 | 003682349
   8 | 158007
  10 | 04134
  12 | 127022
  14 | 444
  16 | 14

Side-by-side boxplots

boxplot(USArrests$Assault ~ USArrests$UrbanPop_Cat,
        col='pink',
        border='darkblue',
        pch=16,
        xlab='Urban population',
        ylab='Assault rate')

Scatterplot

Scatterplots are bi-variate; useful to detect relationship between two numeric (scale) variables

plot(USArrests$UrbanPop,USArrests$Assault)

ggplot2

Example 1

ggplot2 is an alternative, more unified framework (Grammar of Graphics) for creating plots and graphics in R

ggplot(data=USArrests,
       mapping=aes(x=UrbanPop,y=Assault)) + 
  geom_point()

Example 2

ggplot(data=USArrests,
       mapping=aes(x=UrbanPop_Cat,y = Assault, fill=UrbanPop_Cat)) + 
  geom_boxplot()

Basic ingredients of a ggplot

Data
- Data frame containing all the raw observations
Aesthetic mapping
- An aesthetic is a visual property of the plot, such as x-axis, y-axis, color, fill, size, etc.
- Use aes to specify which variable is mapped to which aesthetic
- Example: height -> x-axis; weight -> y-axis; gender -> fill; name -> text
Geoms
- What types of layers should be added to the plots

Example 1 explained

UrbanPop maps to the x-axis
Assault maps to the y-axis
geom_point draws scatterplot-type points for each observation

ggplot(data=USArrests,
       mapping=aes(x=UrbanPop,y=Assault)) + 
  geom_point()

Example 2 explained

UrbanPop_Cat maps to the x-axis AND fill (this kind of redundant mapping should be avoided!)
Assault maps to the y-axis
geom_boxplot draws boxplots
Try: change geom_boxplot to geom_violin

ggplot(data=USArrests,
       mapping=aes(
        x=UrbanPop_Cat,
        y = Assault,
        fill=UrbanPop_Cat
       )) + 
  geom_boxplot()

geom_histogram

Recall: histogram is univariate; y-axis is always frequency, so we don’t need a y=_ mapping

ggplot(data = USArrests,  aes(x = Murder)) + 
  geom_histogram()

Can add arguments in geoms

ggplot(data = USArrests,  aes(x = Murder)) + 
  geom_histogram(binwidth=5,
                 fill="orange",
                 color="blue")

Note the differences in the syntax convention:
- base R col vs ggplot fill
- base R border vs ggplot color
- esp. “color” (ggplot) vs “col” (base R)

Add another dimension (i.e., map another variable to another aesthetic)

ggplot(data = USArrests, aes(x = Murder, fill=UrbanPop_Cat)) + 
  geom_histogram(binwidth=5, color="blue")

geom_bar

The bar graph shows that each category has roughly the same number of states, as expected.

ggplot(data=USArrests, 
       mapping=aes(x=UrbanPop_Cat)) +
  geom_bar()

geom_text

ggplot(data=USArrests,
       aes(x=Murder,y=Assault,label=State)) + 
  geom_text()

Try: change geom_text to geom_label
Try: add geom_point

Customize ggplot with +

ggplot offers a very convenient syntax (+) to add elements and layers

ggplot(data=USArrests, 
       mapping=aes(x=UrbanPop_Cat)) +
  geom_bar() + 
  theme_bw() +
  ggtitle('Bar graph') +
  labs(x='Urban population percentage')

Example: change gradient color

ggplot(data=USArrests,
       aes(x=Murder,y=Assault,color=UrbanPop)) + 
  geom_point()+
  theme_bw() +
  ggtitle('Violent Crime Rates') +
  labs(x='Murder',y="Assault",color='Urban Pop Percentage') +
  scale_color_gradient(low="lightgreen",high="black")

Example: change categorical color

ggplot(data=USArrests, 
       aes(x=Murder,fill=UrbanPop_Cat)) +
  geom_histogram() +
  scale_fill_manual(values=c("Low"='green',"Medium"='purple', "High"="darkgrey")) +
  theme_minimal()

There’s more!

There are many other geoms and customization available!
Each geom has its own set of aesthetics to which variables can be mapped.
Bring up the help page to check what aesthetics are needed.
Many references can be found online - they all share the same, unified structure that you can now understand.

Data manipulation: tidyverse

The Pipe operator

tidyverse packages: https://www.tidyverse.org/packages/
Powerful and popular for working with complicated dataset, typically formatted into a tidy format
Importantly, the pipe operator offers a new way of calling a function
These three lines of code are equivalent

summary(USArrests) 
USArrests |> summary # magrittr / dplyr pipe
USArrests |> summary # base R pipe (R >= 4.0)

Functions in tidyverse are typically used with the pipe operator, but this is not required. For instance, these code are equivalent

USArrests |> filter(UrbanPop > 80) |> select(Murder, UrbanPop) 
select(filter(USArrests,UrbanPop > 80), Murder, UrbanPop)

Useful tidyverse functions

# filter: subset data according to some condition
USArrests |> 
  filter(UrbanPop_Cat=='High', Murder > 10)

# slice: subset specific rows
# select: subset specific columns
USArrests |> 
  slice(1:5) |>
  select(Murder,Assault)

# arrange: sort by a variable
USArrests |> 
  arrange(desc(Murder)) |>
  slice(1:5)

# summarise: calculate summary statistics
USArrests |> 
  summarise(mean(Murder),median(Murder),count=length(Murder))


# mutate: create new variables
USArrests |> 
  mutate(Violent = Murder + Assault + Rape) |>
  arrange(desc(Violent)) |>
  select(Violent)

# group_by: look at subset groups of data
USArrests |> 
  group_by(UrbanPop_Cat) |>
  summarise(mean(Murder), median(Murder))

Integration with ggplot

The pipe operator works very well with ggplot syntax

USArrests |> 
  mutate(Urban = ifelse(UrbanPop > 80, "Yes", "No")) |>
  ggplot(aes(x=Murder, y=Assault, color=Urban)) +
  geom_point() +
  theme_bw()

Combine data: map example

Map

There are a few different maps included in the maps package

# see ?map_data
mymap <- map_data(map="state")

What exactly is mymap? Try some basic visualizations…

plot(mymap$long,mymap$lat)

ggplot(data=mymap,aes(x=long,y=lat,group=group)) +
  geom_polygon() +
  coord_fixed(1.4) +
  theme_void()

plot(mymap$long,mymap$lat)

ggplot(data=mymap,aes(x=long,y=lat,group=group)) +
  geom_polygon() +
  coord_fixed(1.4) +
  theme_void()

Combine data

We want to combine the following dataset
The “State” column in USArrests should match the “region” column in mymap; the state names should have consistent spelling and capitalization

USArrests |> slice_head(n=3)

        Murder Assault UrbanPop Rape   State UrbanPop_tile UrbanPop_Cat
Alabama   13.2     236       58 21.2 Alabama             1          Low
Alaska    10.0     263       48 44.5  Alaska             1          Low
Arizona    8.1     294       80 31.0 Arizona             3         High

mymap |> slice_head(n=3)

       long      lat group order  region subregion
1 -87.46201 30.38968     1     1 alabama      <NA>
2 -87.48493 30.37249     1     2 alabama      <NA>
3 -87.52503 30.37249     1     3 alabama      <NA>

# convert county name to lower case
USArrests$State=tolower(USArrests$State)

# combine data with left_join
mymap_comb = left_join(mymap, USArrests, 
                     by=c('region'='State'))

mymap_comb |> slice_head(n=3)

       long      lat group order  region subregion Murder Assault UrbanPop Rape
1 -87.46201 30.38968     1     1 alabama      <NA>   13.2     236       58 21.2
2 -87.48493 30.37249     1     2 alabama      <NA>   13.2     236       58 21.2
3 -87.52503 30.37249     1     3 alabama      <NA>   13.2     236       58 21.2
  UrbanPop_tile UrbanPop_Cat
1             1          Low
2             1          Low
3             1          Low

Visualization!

mymap_comb |>
  ggplot(aes(x=long,y=lat,group=group,
             fill=Murder)) +
  geom_polygon(color='darkblue') +
  coord_fixed(1.4) +
  theme_void() +
  labs(fill='Murder rate') +
  scale_fill_gradient(low="green", high="red")

Different ways to join data

Run ?left_join() to understand different ways to join data.
Discussion: what will the following code return? You can run the code to find out.

df.student

  name exam project
1    A   80       1
2    B   91       2
3    C   85       1
4    D   78      NA

df.project

  projectID    description
1         1  visualization
2         2 classification
3         3    text mining

inner_join(df.student, df.project, by=c('project' = 'projectID'))
left_join(df.student, df.project, by=c('project' = 'projectID'))
left_join(df.project, df.student, by=c('projectID' = 'project'))
right_join(df.student, df.project, by=c('project' = 'projectID'))
full_join(df.student, df.project, by=c('project' = 'projectID'))