Week 11: Writing Functions & Organizing R Code

Why write functions?

  • Avoid repetition: Don’t Repeat Yourself (DRY principle)
  • Improve readability: Give meaningful names to operations
  • Reduce errors: Fix bugs in one place, not many
  • Make code testable: Easier to verify correctness
  • Enable reuse: Use the same function across projects

Function basics

Anatomy of a function

  • Name: how you’ll call the function
  • Arguments: inputs to the function
  • Body: the code that runs
  • Return value: what the function outputs
function_name <- function(arg1, arg2, ...) {
  # Function body: code that does something
  result <- arg1 + arg2
  
  # Return value (implicit or explicit)
  return(result)
}

Simple example

# Function to calculate the mean of squared deviations
mean_squared_deviation <- function(x) {
  mean_x <- mean(x)
  squared_devs <- (x - mean_x)^2
  result <- mean(squared_devs)
  return(result)
}

# Test it
values <- c(2, 4, 6, 8, 10)
mean_squared_deviation(values)
[1] 8
# Compare to variance
var(values) * (length(values) - 1) / length(values)
[1] 8

Return values

  • Functions return the last expression evaluated (implicit return)
  • Or use return() for explicit return (clearer for complex functions)
# Implicit return
add_implicit <- function(a, b) {
  a + b
}

# Explicit return
add_explicit <- function(a, b) {
  result <- a + b
  return(result)
}

add_implicit(3, 5)
[1] 8
add_explicit(3, 5)
[1] 8

Default arguments

  • Provide default values for arguments
  • Makes functions more flexible
# Function with default argument
greet <- function(name, greeting = "Hello") {
  paste(greeting, name)
}

greet("Alice")
[1] "Hello Alice"
greet("Bob", greeting = "Hi")
[1] "Hi Bob"

Example: Standardizing variables

# Z-score standardization
standardize <- function(x, na.rm = TRUE) {
  mean_x <- mean(x, na.rm = na.rm)
  sd_x <- sd(x, na.rm = na.rm)
  z <- (x - mean_x) / sd_x
  return(z)
}

# Test it
test_data <- c(10, 20, 30, 40, 50, NA)
standardize(test_data)
[1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111         NA
# The standardized values have mean 0 and sd 1
mean(standardize(test_data), na.rm = TRUE)
[1] 8.881784e-17
sd(standardize(test_data), na.rm = TRUE)
[1] 1

The ellipsis (...) argument

  • The ... allows passing additional arguments to functions called within
  • Useful when you want to pass arguments through to another function
# Function that passes arguments to mean()
calculate_mean <- function(x, ...) {
  mean(x, ...)
}

# Now we can pass na.rm, trim, etc. to mean()
data_with_na <- c(1, 2, 3, NA, 5)
calculate_mean(data_with_na, na.rm = TRUE)
[1] 2.75
calculate_mean(data_with_na, na.rm = TRUE, trim = 0.1)
[1] 2.75

Example: Wrapper function with ...

# Create a custom plotting function that passes extra args to plot()
my_scatter <- function(x, y, ...) {
  plot(x, y, pch = 19, col = "blue", ...)
}

my_scatter(1:10, (1:10) ^ 2,
           main = "My Plot", xlab = "X values")  # With labels

When to write a function?

The “three times” rule

  • If you copy-paste code three times, write a function
  • Even twice might be worth it!

Before: repetitive code

df$x1 <- (df$x1 - mean(df$x1)) / sd(df$x1)
df$x2 <- (df$x2 - mean(df$x2)) / sd(df$x2)
df$x3 <- (df$x3 - mean(df$x3)) / sd(df$x3)

After: with a function

standardize <- function(x) {
  (x - mean(x)) / sd(x)
}

df$x1 <- standardize(df$x1)
df$x2 <- standardize(df$x2)
df$x3 <- standardize(df$x3)

Function-oriented programming

What is function-oriented programming?

  • A style of organizing data analysis where every step is a function
  • Benefits:
    • Modularity: each function does one thing well
    • Testability: easy to verify each step works
    • Reproducibility: clear pipeline from raw data to results
    • Debugging: isolate problems quickly
    • Collaboration: others can understand and modify your work

Traditional “script” approach

Weaknesses:

Especially if the script is longer or more complex than this: - Harder to reuse parts - Difficult to test individual steps - Not always clear what each section does - Variables clutter global environment

# Load data
data <- read.csv("raw_data.csv")

# Clean data
data$x <- ifelse(data$x < 0, NA, data$x)
data$y <- log(data$y + 1)
data <- data[complete.cases(data), ]

# Analyze
model <- lm(y ~ x, data = data)
summary(model)

# Visualize
plot(data$x, data$y)
abline(model, col = "red")

Function-oriented approach

Define functions:

# Step 1: Load data
load_raw_data <- function(filepath) {
  read.csv(filepath)
}

# Step 2: Clean data
clean_data <- function(data) {
  data |>
    mutate(
      x = ifelse(x < 0, NA, x),
      y = log(y + 1)
    ) |>
    filter(complete.cases(data))
}

# Step 3: Fit model
fit_model <- function(data) {
  lm(y ~ x, data = data)
}

# Step 4: Create plot
plot_results <- function(data, model) {
  ggplot(data, aes(x = x, y = y)) +
    geom_point() +
    geom_smooth(method = "lm", color = "red")
}

Execute pipeline:

# Load
data <- load_raw_data("raw_data.csv")

# Clean
data_clean <- clean_data(data)

# Analyze
model <- fit_model(data_clean)

# Visualize
plot_results(data_clean, model)

Benefits:

  • Clear, readable pipeline
  • Easy to test each step
  • Functions can be reused
  • Changes isolated to one place

Principles of good functions

  1. Do one thing well: each function has a single, clear purpose
  2. Use descriptive names: calculate_mean_age() not f1()
  3. Keep them short: if a function is too long, break it into smaller functions
  4. Minimize side effects: don’t modify global variables
  5. Document your functions: explain what they do, what arguments they take, what they return
# Good: clear name, single purpose, documented
#' Calculate the coefficient of variation
#' 
#' @param x A numeric vector
#' @param na.rm Logical; should missing values be removed?
#' @return The coefficient of variation (sd/mean)
coefficient_of_variation <- function(x, na.rm = TRUE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}

Organizing your code

File structure for data analysis

project/
├── data/
│   ├── raw/              # Original, immutable data
│   └── processed/        # Cleaned data
├── R/
│   ├── 01_load.R        # Data loading functions
│   ├── 02_clean.R       # Data cleaning functions
│   ├── 03_analyze.R     # Analysis functions
│   └── 04_visualize.R   # Plotting functions
├── reports/
│   └── analysis.qmd      # Quarto document
├── output/
│   ├── figures/
│   └── tables/
└── main.R               # Main pipeline script

Example: main.R

# Main analysis pipeline
source("R/01_load.R")
source("R/02_clean.R")
source("R/03_analyze.R")
source("R/04_visualize.R")

# Load data
raw_data <- load_raw_data("data/raw/survey_data.csv")

# Clean data
clean_data <- clean_survey_data(raw_data)

# Analyze
summary_stats <- calculate_summary_statistics(clean_data)
model_results <- fit_regression_model(clean_data)

# Visualize
plot_distribution(clean_data, "age")
plot_regression_results(clean_data, model_results)

# Save results
save_results(summary_stats, "output/tables/summary_stats.csv")
save_plot(last_plot(), "output/figures/regression_plot.png")

Good practices

Documentation with roxygen2 comments

What is roxygen2?

  • R package for documenting functions
  • Special comments starting with #'
  • Can generate help files automatically
  • Standard format used in R packages

Key tags:

  • @param: describe arguments
  • @return: describe output
  • @examples: show usage

Learn more: roxygen2.r-lib.org

#' Calculate standardized effect size (Cohen's d)
#'
#' @param group1 Numeric vector for group 1
#' @param group2 Numeric vector for group 2
#' @param na.rm Logical; should missing values be removed? Default TRUE
#'
#' @return Numeric value representing Cohen's d effect size
#' @export
#'
#' @examples
#' group_a <- c(10, 12, 14, 16, 18)
#' group_b <- c(15, 17, 19, 21, 23)
#' cohens_d(group_a, group_b)
cohens_d <- function(group1, group2, na.rm = TRUE) {
  mean_diff <- mean(group1, na.rm = na.rm) - mean(group2, na.rm = na.rm)
  pooled_sd <- sqrt((var(group1, na.rm = na.rm) + var(group2, na.rm = na.rm)) / 2)
  return(mean_diff / pooled_sd)
}

Common pitfalls to avoid

  1. Modifying global variables inside functions
# BAD: modifies global variable
process_data <- function() {
  data <<- data |> filter(!is.na(x))  # Don't do this!
}

# GOOD: returns a value
process_data <- function(data) {
  data |> filter(!is.na(x))
}

More pitfalls

  1. Overly complex functions
# BAD: does too many things
analyze_everything <- function(data) {
  # 100 lines of code doing loading, cleaning, analyzing, plotting...
}

# GOOD: break into smaller functions
load_data <- function(file) { ... }
clean_data <- function(data) { ... }
analyze_data <- function(data) { ... }
plot_results <- function(results) { ... }
  1. Poor naming
# BAD
f <- function(x, y) { ... }
calc <- function(d) { ... }

# GOOD
calculate_correlation <- function(variable1, variable2) { ... }
summarize_by_group <- function(data) { ... }

Summary

Key takeaways

  1. Write functions to avoid repetition and improve code quality
  2. Function-oriented programming organizes analyses into modular, testable steps
  3. Good functions:
    • Do one thing well
    • Have clear names
    • Include default arguments where appropriate
    • Handle errors gracefully
    • Are documented
  4. Organize projects with clear folder structure and pipeline scripts

Exam review questions

Live (anonymous) quiz

QR code

Lab 5

Today’s lab: Function-oriented workflow

  • Lab 5 provides hands-on practice with:
    • Refactoring a data analysis script into a well-organized function-oriented project