Week 5: Data manipulation & visualization
2025-10-28
The following packages are required for today’s lecture. Please install first if needed.
To start, we’ll use a very simple dataset in Base R.
Create a separate variable for state names; create a categorical (ordinal) variable based on Urban Population. More on this syntax later.
ggplot offers a very convenient syntax (+) to add elements and layers
tidyverse packages: https://www.tidyverse.org/packages/
Powerful and popular for working with complicated dataset, typically formatted into a tidy format
Importantly, the pipe operator offers a new way of calling a function
These three lines of code are equivalent
# filter: subset data according to some condition
USArrests |>
filter(UrbanPop_Cat=='High', Murder > 10)
# slice: subset specific rows
# select: subset specific columns
USArrests |>
slice(1:5) |>
select(Murder,Assault)
# arrange: sort by a variable
USArrests |>
arrange(desc(Murder)) |>
slice(1:5)
# summarise: calculate summary statistics
USArrests |>
summarise(mean(Murder),median(Murder),count=length(Murder))
# mutate: create new variables
USArrests |>
mutate(Violent = Murder + Assault + Rape) |>
arrange(desc(Violent)) |>
select(Violent)
# group_by: look at subset groups of data
USArrests |>
group_by(UrbanPop_Cat) |>
summarise(mean(Murder), median(Murder)) Murder Assault UrbanPop Rape State UrbanPop_tile UrbanPop_Cat
Alabama 13.2 236 58 21.2 Alabama 1 Low
Alaska 10.0 263 48 44.5 Alaska 1 Low
Arizona 8.1 294 80 31.0 Arizona 3 High
long lat group order region subregion
1 -87.46201 30.38968 1 1 alabama <NA>
2 -87.48493 30.37249 1 2 alabama <NA>
3 -87.52503 30.37249 1 3 alabama <NA>
long lat group order region subregion Murder Assault UrbanPop Rape
1 -87.46201 30.38968 1 1 alabama <NA> 13.2 236 58 21.2
2 -87.48493 30.37249 1 2 alabama <NA> 13.2 236 58 21.2
3 -87.52503 30.37249 1 3 alabama <NA> 13.2 236 58 21.2
UrbanPop_tile UrbanPop_Cat
1 1 Low
2 1 Low
3 1 Low
name exam project
1 A 80 1
2 B 91 2
3 C 85 1
4 D 78 NA
projectID description
1 1 visualization
2 2 classification
3 3 text mining
inner_join(df.student, df.project, by=c('project' = 'projectID'))
left_join(df.student, df.project, by=c('project' = 'projectID'))
left_join(df.project, df.student, by=c('projectID' = 'project'))
right_join(df.student, df.project, by=c('project' = 'projectID'))
full_join(df.student, df.project, by=c('project' = 'projectID'))