Week 3: Descriptive Statistics and Summarizing Data

Refresher I

Class and type

  • We can use class() and typeof(), respectively, to determine the class and basic type of an object
data(diamonds, package = "ggplot2") # load an example dataset
class(diamonds) # inherits from data.frame, among other classes
[1] "tbl_df"     "tbl"        "data.frame"
typeof(diamonds) # basic type is a list
[1] "list"

Refresher II

Basic structure

  • The str() function produces a concise summary of the structure of the object
str(diamonds)
Classes 'tbl_df', 'tbl' and 'data.frame':   53940 obs. of  10 variables:
 $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Refresher III

Subsetting a column from a data frame

  • We can use $ or [[ to subset columns from a data frame
str(diamonds$price) # structure of the "price" variable
 int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
identical(diamonds$price, diamonds[["price"]])
[1] TRUE
  • We could define a new variable to take the value of the column
price <- diamonds$price
identical(price, diamonds$price)
[1] TRUE

Refresher IV

Type coercion

  • If we wanted, we could coerce this integer object to a different type
price_num <- as.numeric(price)
str(price_num)
 num [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
price_chr <- as.character(price)
str(price_chr)
 chr [1:53940] "326" "326" "327" "334" "335" "336" "336" "337" "337" "338" ...

Refresher V

Aside about memory efficiency

  • Coercion to numeric or character would not be very advantageous in this case, and would come with a downside: the integer version takes up far less memory
object.size(price)
215808 bytes
object.size(price_num)
431568 bytes
object.size(price_chr)
1081280 bytes

What does the dataset look like?

Printout of the diamonds data
library(DT) # for the tabular display
DT::datatable(diamonds, options = list(pageLength = 5))

What are some questions we might pose about these data?

Interlude: statistical terminology refresher

What is a population?

  • A large group of individuals or units{.incremental}
    • May be difficult to measure everyone{.incremental}
    • Still would like to draw conclusions about the large group, perhaps based on partial data{.incremental}

What is a distribution?

  • The distribution of a measurement refers to the relative frequencies or *prevalence of different possible values across individuals in the population{.incremental}

What is a sample?

  • A set of measured individuals or units{.incremental}
  • “Sample” may also refer to the measurements themselves{.incremental}

What is a statistic?

  • Any function of the sample{.incremental}
  • For example, a numerical summary of the sample{.incremental}

Descriptive statistics

What are Descriptive Statistics?

  • Statistics1 describing the main features of a dataset
  • Help us understand the distribution, center, and spread of data.

Types of Descriptive Statistics

  • Measures of Center: Mean, Median, Mode
  • Measures of Spread: Range, Variance, Standard Deviation, IQR
  • Shape: Skewness, Kurtosis
  • Frequency Tables: Counts and proportions for categorical variables

Summarizing Numeric Data

Measures of Center

  • Mean: Arithmetic average
  • Median: Middle value
  • Mode: Most frequent value
x <- c(1, 2, 2, 3, 4)
mean(x)
[1] 2.4
median(x)
[1] 2
# Mode is not built-in; use table(x)

Measures of Spread

  • Range: Difference between max and min
  • Variance: Average squared deviation from mean
  • Standard Deviation: Square root of variance
  • IQR: Interquartile range (Q3 - Q1)
range(x)
[1] 1 4
var(x)
[1] 1.3
sd(x)
[1] 1.140175
IQR(x)
[1] 1

Summarizing Categorical Data

Frequency Tables

  • Use table() to count occurrences
  • Use prop.table() for proportions
species <- c("Adelie", "Chinstrap", "Adelie", "Gentoo")
table(species)
species
   Adelie Chinstrap    Gentoo 
        2         1         1 
prop.table(table(species))
species
   Adelie Chinstrap    Gentoo 
     0.50      0.25      0.25 

Back to the Diamonds Dataset

Recalling the structure

str(diamonds)
Classes 'tbl_df', 'tbl' and 'data.frame':   53940 obs. of  10 variables:
 $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Diamond price

Measures of center

  • What is the average price (sample mean price)?
mean(diamonds$price)
[1] 3932.8
  • What is a representative price? (Median)
median(diamonds$price)
[1] 2401

Diamond price

Measures of spread

  • Range of values
range(diamonds$price)
[1]   326 18823
  • Standard deviation and variance
sd(diamonds$price)
[1] 3989.44
var(diamonds$price)
[1] 15915629
sd(diamonds$price) == sqrt(var(diamonds$price))
[1] TRUE

Diamond price

Measures of spread

  • Quartiles: the summary() function returns the quartiles, among other information
summary(diamonds$price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    326     950    2401    3933    5324   18823 
  • Interquartile range
IQR(diamonds$price)
[1] 4374.25

Diamond price

Shape

  • A histogram is a visual display of the distribution of a scale variable
hist(diamonds$price)

Diamond price

Shape

  • Based on the histogram and the earlier descriptive statistics, describe the shape of the distribution of diamond prices
    • Unimodal
    • Right skewed
    • Mean price is 3932.8 USD
    • Standard deviation is 3989.44 USD
    • Middle 50% of diamonds range in price from 950 USD to 5324 USD (see summary(diamonds$price))

Diamond cut quality

The cut variable

  • This is an example of a categorical variable
str(diamonds$cut) # its class in R is a factor
 Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
  • The possible levels of the cut variable are:
levels(diamonds$cut)
[1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"    

Diamond cut quality

Frequency table

  • The frequency table below tells us how common each cut quality is within the sample
table(diamonds$cut)

     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551 
  • Q: What is the most common cut (the mode)? Which cut appears least often in the dataset?

Diamond cut quality

Proportion table (relative frequency)

  • Rather than the total counts per category, we may be interested in the proportion of diamonds in each category
  • For such a relative frequency table, we can do:
prop.table(table(diamonds$cut))

      Fair       Good  Very Good    Premium      Ideal 
0.02984798 0.09095291 0.22398962 0.25567297 0.39953652 

Outliers and outlier-resistant summaries

Boston housing data

Printout of the Boston housing data
data("Boston", package = "MASS")
DT::datatable(Boston, options = list(pageLength = 5))

variable description
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio by town
lstat % lower status of the population
medv Median value of owner-occupied homes in $1000’s

Per capita crime rate by town

Histogram

hist(Boston$crim, xlab = vardesc[1])

Per capita crime rate by town

Boxplot

boxplot(Boston$crim, ylab = vardesc[1])

Per capita crime rate by town

  • This variable has a strongly right skewed distribution
  • Mean and median will have very different interpretations in this situation
mean(Boston$crim)
[1] 3.613524
median(Boston$crim)
[1] 0.25651

Robust or outlier-resistant statistics

Measure Type Sensitive to Outliers? Notes
Mean Center Yes Strongly affected by extreme values
Median Center No Robust; not affected by outliers
Mode Center No Not affected by outliers
Range Spread Yes Determined by min and max values
Variance Spread Yes Squared deviations amplify outliers
Standard Deviation Spread Yes Derived from variance; sensitive
Interquartile Range Spread No Only uses middle 50% of data

Summary

Describing distributions

  • Descriptive statistics help us understand the center, spread, and shape of data
  • Use mean, median, and mode for measures of center; range, standard deviation, and IQR for spread
  • Frequency and proportion tables summarize categorical variables
  • Visualizations (histograms, boxplots, barplots) reveal distribution and outliers
  • Always consider the impact of outliers and choose robust statistics when appropriate