Week 4: Basic Statistical Inference

What is inference about?

  • Drawing conclusions about a large population on the basis of a (not necessarily large) sample
  • Challenge: observations based on samples are random
    • We would see something different if we took a new sample
  • If we see a trend in the sample, how do we distinguish if it is:
    • A signal: a real phenomenon that holds in the entire population
    • Just noise: attributable to random chance alone

Experimental philosophy

Tea tasting experiment

The claim

  • At a party, the topic of “milk first” versus “tea first” is being discussed
  • A partygoer claims she can distinguish the order in which they were poured based on taste alone

Tea tasting experiment

How to test this claim?

R.A. Fisher is widely regarded as the father of statistical inference
  • Another partygoer proposes to test this claim scientifically
  • R.A. Fisher reported on this experiment in his 1935 book The Design of Experiments

Tea tasting experiment

Formalizing the research question

  • In Fisher’s approach, we specify two competing hypotheses
Hypothesis Name Description Interpretation
\(H_0\) Null Hypothesis Lady cannot distinguish tea types Success rate = 0.5 (random guessing)
\(H_A\) Alternative Hypothesis Lady can distinguish tea types Success rate > 0.5 (genuine ability)

Tea tasting experiment

Collecting data (taking a sample)

  • They decide on the following experiment:
    • 8 cups of tea will be poured
      • 4 will have tea poured first
      • 4 will have milk poured first
  • These cups are presented to her in random order. Only the experimenter knows which was which.

Tea tasting experiment

Assessing the evidence

  • Fisher’s philosophy focuses on collecting data in order to assess the strength of evidence against \(H_0\) (i.e. in favor of \(H_A\))
  • i.e. how strong is the evidence that the tea taster has genuine ability

Tea tasting experiment

Assessing the evidence (to be continued…)

  • Once we see the data, e.g.
True: Tea First True: Milk First
Guess: Tea First 3 1
Guess: Milk First 1 3
  • Now we need to decide: are we convinced this is signal (i.e. can we rule out that it is noise)?

Statistical inference

Statistics vs. parameters

Statistic Parameter
Definition A numerical summary of observed data A characteristic of the true population-wide distribution
Symbol Roman letters (e.g., \(\bar{x}\), \(s\), \(\hat{p}\)) Greek letters (e.g., \(\mu\), \(\sigma\), \(p\))
Known/Unknown Known (we calculate it from our data) Unknown (what we want to learn about)
Variability Varies from sample to sample Fixed (but unknown) value
Example Sample mean height = 68.2 inches Population mean height = \(\mu\) (unknown)

Types of parameters

Type Symbol Description Example
Mean \(\mu\) Average value of a numeric variable in the population Population mean height, income, test scores
Proportion \(p\) Fraction of population with a certain characteristic Proportion of voters supporting a candidate
Frequency Distribution \(\pi_1, \pi_2, \ldots\) Proportions for each category of a categorical variable Distribution of blood types (A, B, AB, O)

Role of models

flowchart TD
A[Unknown Population Parameters<br/>μ, σ, p, etc.] --> B[Probability Model<br/>Normal, Binomial, etc.]
B --Random sampling mechanism--> D[Observed Sample Data<br/>x₁, x₂, ..., xₙ]
D --Statistical inference--> A

style A fill:#ffcccc
style D fill:#ccffcc

Types of inference

Confidence intervals

  • A confidence interval provides a range of plausible values for an unknown parameter
  • Constructed using sample data to estimate the parameter with a specified level of confidence
  • Example: “We are 95% confident that the true population mean is between 2.3 and 4.7”
  • The confidence level (e.g., 95%) tells us how often our method would capture the true parameter if we repeated the process many times

Hypothesis tests

  • A hypothesis test evaluates evidence against a null hypothesis (\(H_0\))
  • We calculate a p-value: the probability of observing data as extreme (or more extreme) than what we saw, assuming \(H_0\) is true
  • Small p-values (typically < 0.05) provide strong evidence against \(H_0\)
    • The significance level (e.g. \(\alpha = 0.05\)) is a pre-specified characteristic of the test: controls how convincing the evidence must be
  • Example: “The p-value is 0.032, so we reject \(H_0\) at the 0.05 significance level”

Duality of confidence intervals and hypothesis tests

  • Confidence intervals and hypothesis tests are closely related:
    • If a 95% confidence interval for \(\mu\) does not contain the value \(\mu_0\), then a two-sided test of \(H_0: \mu = \mu_0\) will have p-value < 0.05
    • If a 95% confidence interval does contain \(\mu_0\), then the p-value will be ≥ 0.05
  • Confidence intervals tell us about plausible parameter values
  • Hypothesis tests tell us whether specific parameter values are plausible

Cookbook for inference in R

Flowchart

flowchart TD
    A[What type of parameter?] --> B[Mean]
    A --> C[Proportion] 
    A --> D[Categorical Association]
    
    B --> E[How many groups?]
    E --> F[One sample<br/>t.test&#40;x, mu = μ₀&#41;]
    E --> G[Two samples<br/>t.test&#40;y ~ group&#41;<br/>or<br/>t.test&#40;x1, x2&#41;]
    
    C --> H[How many groups?]
    H --> I[One sample<br/>prop.test&#40;x, n, p = p₀&#41;]
    H --> J[Two samples<br/>prop.test&#40;c&#40;x1,x2&#41;, c&#40;n1,n2&#41;&#41;]
    
    D --> K[Test of independence<br/>chisq.test&#40;table&#40;var1, var2&#41;&#41;]
    
    style F fill:#e1f5fe
    style G fill:#e1f5fe
    style I fill:#f3e5f5
    style J fill:#f3e5f5
    style K fill:#e8f5e8

Examples

Single unknown mean

Example: Anorexia weight gain

  • 72 sampled anorexia patients receiving various therapies for their condition
  • For each individual, we observe their weight before and after therapy
  • This is a histogram of the weight changes
Load the dataset, derive the weight change, and plot
data(anorexia, package = "MASS")
weight_chg <- anorexia$Postwt - anorexia$Prewt
hist(weight_chg, xlab = "Difference in weight before and after therapy")

Parameter of interest

  • \(\mu\) = mean weight change among all anorexia patients
  • This is unknown, but:
    • We do know \(\bar x\), the sample mean among these 72 patients
summary(weight_chg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-12.200  -2.225   1.650   2.764   9.100  21.500 
  • Among all anorexia patients, is the mean change in weight larger than zero?

Hypothesis test: formalizing the question

  • \(H_0: \mu = 0\), versus
  • \(H_A: \mu > 0\)
  • at level \(\alpha = 0.05\)

Hypothesis test: result

test_result <- t.test(
  x = weight_chg, # numeric vector of observations
  mu = 0, # value from H0
  alternative = "greater" # what kind of inequality is in HA
)
test_result

    One Sample t-test

data:  weight_chg
t = 2.9376, df = 71, p-value = 0.002229
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 1.195825      Inf
sample estimates:
mean of x 
 2.763889 
  • Outcome: The p-value is 0.002, smaller than \(\alpha = 0.05\). We reject \(H_0\).

  • Interpretation: There is significant evidence that the mean weight gain exceeds zero.

Confidence interval

Provide a 95% confidence interval for \(\mu\)

# do not need "mu" or "alternative" arguments here
# we do need the "level" argument
ci_result <- t.test(x = weight_chg,
                    level = 0.95) 
class(ci_result)
[1] "htest"
typeof(ci_result)
[1] "list"
names(ci_result)
 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
ci_result$conf.int
[1] 0.8878354 4.6399424
attr(,"conf.level")
[1] 0.95
  • Outcome: A 95% confidence interval for \(\mu\) is (0.89, 4.64)

  • Interpretation: We are 95% confident that the population mean weight change is between 0.89 and 4.64.

Comparing two means

Example: clinical trial

  • Objective: design an experiment to determine if a treatment is beneficial
  • Key attribute: randomization ensures that any differences between groups is attributable to treatment
    • Reduces or removes confounding

flowchart LR
    A[Eligible Patients<br/>with Epilepsy] --> B[Randomization]
    B --> C[Treatment Group<br/>Progabide]
    B --> D[Control Group<br/>Placebo]
    C --> E[Measure<br/>log Seizure Rate]
    D --> F[Measure<br/>log Seizure Rate]
    E --> G[Statistical Comparison<br/>t.test&#40;log.seizure.rate ~ treatment&#41;]
    F --> G
    
    style A fill:#f9f9f9
    style B fill:#fff2cc
    style C fill:#e1f5fe
    style D fill:#ffebee
    style G fill:#f3e5f5

Outcome data from the trial

Load and display the epilepsy clinical trial data
data("epilepsy", package = "HSAUR")
epilepsy <- subset(epilepsy, period == "4")
epilepsy$log.seizure.rate <- log(1 + epilepsy$seizure.rate)
epilepsy$treatment <- relevel(epilepsy$treatment, "Progabide")
str(epilepsy)
'data.frame':   59 obs. of  7 variables:
 $ treatment       : Factor w/ 2 levels "Progabide","placebo": 2 2 2 2 2 2 2 2 2 2 ...
 $ base            : int  11 11 6 8 66 27 12 52 23 10 ...
 $ age             : int  31 30 25 36 22 29 31 42 37 28 ...
 $ seizure.rate    : int  3 3 5 4 21 7 2 12 5 0 ...
 $ period          : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 4 4 4 4 4 4 4 4 4 4 ...
 $ subject         : Factor w/ 59 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ log.seizure.rate: num  1.39 1.39 1.79 1.61 3.09 ...

Formalizing the research question

  • The question concerns the mean of the log seizure rate
    • \(\mu_{\text{treatment}} = \text{mean among patients receiving the active treatment}\)
    • \(\mu_{\text{placebo}} = \text{mean among patients receiving placebo}\)
  • We have two competing hypotheses:
    • \(H_0: \mu_{\text{treatment}} = \mu_{\text{placebo}}\) (no difference)
    • \(H_A: \mu_{\text{treatment}} < \mu_{\text{placebo}}\) (lower seizure rate on treatment)
  • We will test them at \(\alpha = 0.05\)

Assessing the evidence

test_result <- t.test(
  log.seizure.rate ~ treatment,
  data = epilepsy,
  alternative = "less"
)
test_result

    Welch Two Sample t-test

data:  log.seizure.rate by treatment
t = -1.5002, df = 56.294, p-value = 0.06958
alternative hypothesis: true difference in means between group Progabide and group placebo is less than 0
95 percent confidence interval:
       -Inf 0.04003111
sample estimates:
mean in group Progabide   mean in group placebo 
               1.541449                1.890246 
  • Outcome: the p-value is 0.0696, which is greater than \(\alpha = 0.05\). We do not reject \(H_0\).

  • Interpretation: there is insufficient evidence to conclude that the mean log seizure rate among treated patients is lower than among patients receiving placebo.

Single proportion

Election poll

  • The October 20-23, 2024 NYT/Siena poll surveyed a random sample of 2516 likely voters nationwide.
  • One question asked was: if the 2024 presidential election were held today, who would you vote for?
Define and display the poll results
n <- 2516
poll_data <- tribble(
  ~candidate, ~count,
  "Kamala Harris", 1132,
  "Donald Trump", 1157,
  "All other candidates", 227
) 
poll_data$proportion <- scales::percent(poll_data$count / n, 1)

knitr::kable(poll_data)
candidate count proportion
Kamala Harris 1132 45%
Donald Trump 1157 46%
All other candidates 227 9%

A research question

  • What proportion of all likely voters at the time would vote for Kamala Harris?
  • Let’s construct a 95% confidence interval for this unknown proportion

Constructing a confidence interval

ci_result <- prop.test(x = 1132, n = 2516)$conf.int
ci_result
[1] 0.4303749 0.4696204
attr(,"conf.level")
[1] 0.95
  • We are 95% confident that between 43.0% and 47.0% of likely voters would vote for Kamala Harris

Testing a hypothesis

  • Among all likely voters, is the proportion of Kamala Harris voters less than 50%?
  • \(H_0: p = 0.5\)
  • \(H_A: p < 0.5\)
  • Test at \(\alpha = 0.05\)

Test result

prop.test(x = 1132, n = 2516, alternative = "less", p = 0.5)

    1-sample proportions test with continuity correction

data:  1132 out of 2516, null probability 0.5
X-squared = 25.04, df = 1, p-value = 2.807e-07
alternative hypothesis: true p is less than 0.5
95 percent confidence interval:
 0.0000000 0.4664785
sample estimates:
        p 
0.4499205 
  • The p-value is extremely small (less than \(\alpha = 0.05\))
  • There is convincing evidence that less than 50% of likely voters favor Kamala Harris

Comparing two proportions

Arthritis trial example

flowchart LR
    A[Eligible Patients<br/>with arthritis] --> B[Randomization]
    B --> C[Treatment]
    B --> D[Placebo]
    C --> E[Proportion of patients<br/>showing marked symptom improvement]
    D --> F[Proportion of patients<br/>showing marked symptom improvement]
    E --> G[Statistical Comparison<br/>prop.test]
    F --> G
    
    style A fill:#f9f9f9
    style B fill:#fff2cc
    style C fill:#e1f5fe
    style D fill:#ffebee
    style G fill:#f3e5f5

Trial outcome data

Load the patient-level data
if(!"vcd" %in% installed.packages()) install.packages("vcd")
data("Arthritis", package = "vcd")
str(Arthritis)
'data.frame':   84 obs. of  5 variables:
 $ ID       : int  57 46 77 17 36 23 75 39 33 55 ...
 $ Treatment: Factor w/ 2 levels "Placebo","Treated": 2 2 2 2 2 2 2 2 2 2 ...
 $ Sex      : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ Age      : int  27 29 30 32 46 58 59 59 63 63 ...
 $ Improved : Ord.factor w/ 3 levels "None"<"Some"<..: 2 1 1 3 3 3 1 3 1 1 ...

Formalizing the research question

  • Is the probability of marked symptom improvement higher on treatment than placebo?
  • \(H_0: p_{\text{treatment}} = p_{\text{placebo}}\)
  • \(H_A: p_{\text{treatment}} > p_{\text{placebo}}\)
  • Test at \(\alpha = 0.05\)

Trial outcome data

counts <- with(
  Arthritis,
  table(Treatment, Improved == "Marked",
        dnn = list("treatment", "marked_improvement"))
)
counts
         marked_improvement
treatment FALSE TRUE
  Placebo    36    7
  Treated    20   21
prop.table(counts, margin = 1)
         marked_improvement
treatment     FALSE      TRUE
  Placebo 0.8372093 0.1627907
  Treated 0.4878049 0.5121951
class(counts)
[1] "table"

Testing the hypotheses

  • Here we pass the table object as the first argument to prop.test()
prop.test(
  counts[c("Treated", "Placebo"),
         c("TRUE", "FALSE")],
  alternative = "greater"
)

    2-sample test for equality of proportions with continuity correction

data:  counts[c("Treated", "Placebo"), c("TRUE", "FALSE")]
X-squared = 10.012, df = 1, p-value = 0.0007778
alternative hypothesis: greater
95 percent confidence interval:
 0.1672693 1.0000000
sample estimates:
   prop 1    prop 2 
0.5121951 0.1627907 
  • The p-value is small (less than \(\alpha = 0.05\)). We reject \(H_0\). There is strong evidence that the response rate is higher among treated patients than patients receiving placebo.

Producing a confidence interval

For the difference in two proportions

ci_result <- prop.test(counts)$conf.int
ci_result
[1] 0.1369409 0.5618679
attr(,"conf.level")
[1] 0.95
  • We are 95% confident that the difference in response rates (treatment versus placebo) is between 13.7% and 56.2%.

Associations with categorical data

Titanic example

Load and display the Titanic survival data
data("Titanic", package = "datasets")
str(Titanic)
 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
 - attr(*, "dimnames")=List of 4
  ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
  ..$ Sex     : chr [1:2] "Male" "Female"
  ..$ Age     : chr [1:2] "Child" "Adult"
  ..$ Survived: chr [1:2] "No" "Yes"

The research question

  • Does the probability of surviving the wreck depend on class?
  • Preparing the data (marginalizing over other factors)
class_table <- margin.table(Titanic, c("Class", "Survived"))
class_table
      Survived
Class   No Yes
  1st  122 203
  2nd  167 118
  3rd  528 178
  Crew 673 212

Formalizing

  • Is the distribution of survival independent of class?
  • \(H_0: \text{survival is independent of class}\)
  • \(H_A: \text{survival is NOT independent of class}\)
  • Test at \(\alpha = 0.05\)

Carrying out the test

chisq.test(class_table)

    Pearson's Chi-squared test

data:  class_table
X-squared = 190.4, df = 3, p-value < 2.2e-16
  • The p-value is very small. We reject \(H_0\). There is strong evidence that the survival is not independent from class

Returning to the tea-tasting example

Tea tasting experiment

  • The hypotheses
Hypothesis Name Description Interpretation
\(H_0\) Null Hypothesis Lady cannot distinguish tea types Success rate = 0.5 (random guessing)
\(H_A\) Alternative Hypothesis Lady can distinguish tea types Success rate > 0.5 (genuine ability)
  • The experimental data
True: Tea First True: Milk First
Guess: Tea First 3 1
Guess: Milk First 1 3

Preparing the data

outcome <- matrix(
  c(3, 1,
    1, 3),
  nrow = 2, ncol = 2,
  byrow = TRUE, dimnames = list(guess = c("tea first", "milk first"),
                                truth = c("tea first", "milk first"))
)
outcome
            truth
guess        tea first milk first
  tea first          3          1
  milk first         1          3

Assessing the strength of evidence

  • Fisher used Fisher’s exact test (which we will not cover in the class - still you should be able to interpret the resulting p-value)
fisher.test(outcome, alternative = "greater")

    Fisher's Exact Test for Count Data

data:  outcome
p-value = 0.2429
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 0.3135693       Inf
sample estimates:
odds ratio 
  6.408309 
  • The p-value is large (larger than \(\alpha = 0.05\)). We do not reject \(H_0\). There is insufficient evidence to conclude the lady can distinguish tea types better than random guessing.