Week 1: Introduction

Getting started

Acquaintances

About me
About you
About the course

Course goals

Attain basic programming fluency in R
Use R to:
- Manipulate and summarize data
- Apply statistical techniques to learn from data
Strengthen your skills at:
- Making data-based insights to real-world problems
- Communicating findings clearly
- Assessing strength of statistical arguments

About R

Prehistory: the S language
- S was developed by John M. Chambers at Bell labs in the 1980s
- Focus was on data analysis and graphics
- S version 3 was released alongside the influential book Statistical Models in S (1992) by Chambers and Trevor Hastie (a highly influential statistician in the area of statistical and machine learning)
- S was proprietary software distributed by Bell, later commercialized as S-PLUS
Development of R
- R is a free open-source implementation of S
- It was written by Ross Ihaka and Robert Gentlemen at the University of Auckland

Languages for modern data science

Contemporary landscape

Highly multi-lingual enviornment for practicing data scientists today: Python, R, Julia, etc.
Relative strengths of R
- Popularity in the field of statistical reasearch (in some more applied sub-fields, like AI and Machine Learning, Python is ascendant)
- Open-source extensions (“libraries”) include leading frameworks for data manipulation and visualization, and also for cutting edge new methodologies
Relative weaknesses of R
- Speed and performance
- Weak native object orientation

Glimpse of R in action

Example: R package download counts

Getting the data from `cranlogs`

We will work with data from various sources this semester
In this example, I will look at data related to download counts for a few R packages from CRAN (the most popular repository for open-source extensions in R)
I do not expect you to understand all this code
Goal is just to highlight the efficiency with which R can handle data-analysis tasks

Getting some data about package download counts

library(cranlogs)

# list a few popular R packages
packages <- c("ggplot2", "dplyr", "cranlogs", "DT", "OncoBayes2")

# use the packageRank package function cranDownloads to 
# get some data about download counts in the past year
download_counts <- cranlogs::cran_downloads(
  packages = packages,
  from = "2025-01-01", to = "2025-08-22"
)

Example: R package download counts

Examine the data

Use the DT interface to DataTables.js for a paginated printout of the raw data

library(DT)
DT::datatable(download_counts, options = list(pageLength = 8))

Example: R package download counts

Visualize the data

R has extremely strong and user-friendly capabilities for graphics
Numerous popular packages, such as ggplot2, extend R’s usefulness in this area even further

Use ggplot2 to visualize the data on package-download counts

library(ggplot2)

ggplot(download_counts,
       aes(x = date, y = count, group = package, color = package)) +
  geom_path() +
  labs(x = "Date (2025)",
       y = "Number of daily package downloads")

Example: R package download counts

Manipulate and transform the data

R likewise has excellent facilities for storing and manipulating tabular data

Use dplyr to consolidate the daily counts into weekly counts

library(dplyr)

# sum counts for each package across each week of 2025
weekly_counts <- download_counts |>
  mutate(week = floor(difftime(date, "2024-12-31", units = "weeks")) +
            as.Date("2025-01-01")) |>
  group_by(package, week) |>
  summarize(count = sum(count), .groups = "drop")

Print the first several rows of the original data

knitr::kable(head(download_counts))

date	count	package
2025-01-01	11099	DT
2025-01-01	45	OncoBayes2
2025-01-01	34	cranlogs
2025-01-01	27657	dplyr
2025-01-01	25618	ggplot2
2025-01-02	13421	DT

Print the first several rows of the transformed data

knitr::kable(head(weekly_counts))

package	week	count
DT	2025-01-01	81713
OncoBayes2	2025-01-01	113
cranlogs	2025-01-01	258
dplyr	2025-01-01	270181
ggplot2	2025-01-01	256966
DT	2025-01-08	78110

Example: R package download counts

Visualize again with alternative time scales

Use ggplot2 to plot the download frequencies on both timescales

bind_rows(mutate(download_counts, time_unit = "daily", time = date),
          mutate(weekly_counts, time_unit = "weekly", time = week)) |>
  ggplot(aes(x = time, y = count, group = package, color = package)) +
  geom_path() +
  labs(x = "Date (2025)",
       y = "Number of package downloads per unit time") +
  facet_wrap(~ time_unit, labeller = "label_both", scales = "free")

Example: R package download counts

Leveraging more packages to simplify tasks

Some extensions to R are highly specialized for specific purposes (e.g. data-analysis tasks, types of data, types of models, etc.)
Many are community-supported, but some have more stable backing
Base R evolves slowly, while its extensions grow and adapt quickly
The right tool can make an otherwise difficult task very easy

Use specialized packages for time-indexed data, tsibble and lubridate, to manipulate ggplot2 download data over a longer timescale

library(tsibble)
library(lubridate)

ggplot_downloads <- cranlogs::cran_downloads(
  packages = "ggplot2",
  from = "2015-01-01", to = "2025-08-22"
) |>
  tsibble(index = "date") |>
  mutate(year = year(date),
         month = factor(month(date, label = TRUE, abbr = TRUE),
                        ordered = FALSE)) |>
  index_by(time = ~ yearmonth(.)) |>
  summarize(count = sum(count), month = first(month), year = first(year))

Example: R package download counts

Leveraging more packages to simplify tasks

Code for a basic plot is made extremely simple by using these specialist libraries

ggplot_plot <- fabletools::autoplot(ggplot_downloads, .vars = count)
ggplot_plot

Example: R package download counts

Fitting a linear regression model

R is designed for and heavily used by statisticians
It has strong native support for classical statistical techniques

# Simple linear regression of monthly download count on time (in months)
fit <- lm(count ~ time, data = ggplot_downloads)
summary(fit)


Call:
lm(formula = count ~ time, data = ggplot_downloads)

Residuals:
     Min       1Q   Median       3Q      Max 
-1510522  -302044   -73162   348538  1643740 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.108e+07  7.803e+05  -14.20   <2e-16 ***
time         6.738e+02  4.240e+01   15.89   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 539600 on 126 degrees of freedom
Multiple R-squared:  0.6671,    Adjusted R-squared:  0.6645 
F-statistic: 252.5 on 1 and 126 DF,  p-value: < 2.2e-16

Summary

R is awesome for people who love data
Base R was developed with statisticians in mind
Statistics has evolved very rapidly as a field in recent decades as computational power has advanced
Open-source extensions to R (“packages”) can dramatically simplify data-analysis tasks

Example: R package download counts

Visualizing the fit

Add the fitted line to the previous plot

# Simple linear regression of monthly download count on time (in months)
ggplot_plot + 
  geom_path(data = mutate(ggplot_downloads, pred = predict(fit)),
            aes(y = pred), color = "red")

Fireside chat

About the last example

Are you familiar with simple linear regression?
What (if anything) concerns you about the use of this technique in the last example?

Some more questions for you

Installing things

Software to install

R (free and open source)
RStudio IDE (free commercial software)
Quarto (free commercial software): “literate programming”

Step-by-step

Go to https://cran.rstudio.com and download the latest version of R
Go to RStudio Desktop download page and download RStudio Desktop
Go to Quarto Getting Started and download Quarto

One more thing

For labs, we will use Quarto to render code and outputs into pdf documents.
This has one more prerequisite
1. In RStudio, in the bottom-left pane, switch from the “Console” to the “Terminal” tab.
2. Type in and run this command: quarto install tinytex

Lab 1

About labs

Labs will be in-class hands-on coding and/or data-analysis exercises
Most labs will entail editing and rendering a skeleton Quarto script I will provide to you
Instructions for the lab will be included in the script for the week

Lab 1

For Lab 1 you’ll start from this pdf: lab_1.pdf

Week 1: Introduction

Getting started

Acquaintances

Course goals

About R

Languages for modern data science

Contemporary landscape

Glimpse of R in action

Example: R package download counts

Getting the data from cranlogs

Example: R package download counts

Examine the data

Example: R package download counts

Visualize the data

Example: R package download counts

Manipulate and transform the data

Example: R package download counts

Visualize again with alternative time scales

Example: R package download counts

Leveraging more packages to simplify tasks

Example: R package download counts

Leveraging more packages to simplify tasks

Example: R package download counts

Fitting a linear regression model

Summary

Example: R package download counts

Visualizing the fit

Fireside chat

About the last example

Some more questions for you

Installing things

Software to install

Step-by-step

One more thing

Lab 1

About labs

Lab 1

Getting the data from `cranlogs`