2  The Data

This project uses the Student Performance dataset from two Portuguese secondary schools, hosted by the UCI Machine Learning Repository. It contains records for 649 students with 33 attributes spanning demographics, family background, study habits, school support, lifestyle factors, absences, and three period grades. Two subject-specific files are provided: Mathematics (student-mat.csv) and Portuguese (student-por.csv). We will analyze only the Mathematics dataset.

2.1 Citation and source

  • Original paper: Paulo Cortez and Alice M. G. Silva (2008), “Using data mining to predict secondary school student performance,” Proceedings of the 5th Annual Future Business Technology Conference. Link: https://www.semanticscholar.org/paper/61d468d5254730bbecf822c6b60d7d6595d9889c
  • Repository citation: Cortez, P. (2008). Student Performance [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5TG7T
  • Dataset page: https://archive.ics.uci.edu/dataset/320/student+performance
  • License: Creative Commons Attribution 4.0 (CC BY 4.0)

2.2 Raw data files

  • data/raw/student-mat.csv — Mathematics subject records

2.3 Build derived dataset

The read_student_mat() and coerce_types() functions are defined in R/01_data.R. They are written for this project, and customized to this dataset. They read the raw CSV file, and perform type coercion on the columns.

raw_data <- read_student_mat(here("data", "raw", "student-mat.csv"))
analysis_data <- coerce_types(raw_data)
saveRDS(analysis_data, here("data", "derived", "student_performance.rda"))
str(analysis_data)
tibble [395 × 33] (S3: tbl_df/tbl/data.frame)
 $ school    : Factor w/ 2 levels "Gabriel Pereira",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sex       : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 2 2 ...
 $ age       : int [1:395] 18 17 15 15 16 16 16 17 15 15 ...
 $ address   : Factor w/ 2 levels "Urban","Rural": 1 1 1 1 1 1 1 1 1 1 ...
 $ famsize   : Factor w/ 2 levels "≤3"," >3": 2 2 1 2 2 1 1 2 1 2 ...
 $ Pstatus   : Factor w/ 2 levels "together","apart": 2 1 1 1 1 1 1 2 2 1 ...
 $ Medu      : Ord.factor w/ 5 levels "none"<"primary(4th)"<..: 5 2 2 5 4 5 3 5 4 4 ...
 $ Fedu      : Ord.factor w/ 5 levels "none"<"primary(4th)"<..: 5 2 2 3 4 4 3 5 3 5 ...
 $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
 $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
 $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
 $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
 $ traveltime: Ord.factor w/ 4 levels "<15m"<"15-30m"<..: 2 1 1 1 1 1 1 2 1 1 ...
 $ studytime : Ord.factor w/ 4 levels "<2h"<"2-5h"<"5-10h"<..: 2 2 2 3 2 2 2 2 2 2 ...
 $ failures  : int [1:395] 0 0 3 0 0 0 0 0 0 0 ...
 $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
 $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
 $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
 $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
 $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
 $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
 $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
 $ famrel    : Ord.factor w/ 5 levels "very bad"<"bad"<..: 4 5 4 3 4 5 4 4 4 5 ...
 $ freetime  : Ord.factor w/ 5 levels "very low"<"low"<..: 3 3 3 2 3 4 4 1 2 5 ...
 $ goout     : Ord.factor w/ 5 levels "very low"<"low"<..: 4 3 2 2 2 2 4 4 2 1 ...
 $ Dalc      : Ord.factor w/ 5 levels "very low"<"low"<..: 1 1 2 1 1 1 1 1 1 1 ...
 $ Walc      : Ord.factor w/ 5 levels "very low"<"low"<..: 1 1 3 1 2 2 1 1 1 1 ...
 $ health    : Ord.factor w/ 5 levels "very bad"<"bad"<..: 3 3 3 5 5 5 3 1 1 5 ...
 $ absences  : int [1:395] 6 4 10 2 4 10 0 6 0 0 ...
 $ G1        : int [1:395] 5 5 7 15 6 15 12 6 16 14 ...
 $ G2        : int [1:395] 6 5 8 14 10 15 12 5 18 15 ...
 $ G3        : int [1:395] 6 6 10 15 10 15 11 6 19 15 ...

The the derived “clean data” object is saved to data/derived/student_performance.rda for use in later sections.

2.4 Variables

As described at the links above:

  • school: Student’s school (GP = Gabriel Pereira, MS = Mousinho da Silveira)
  • sex: Sex (F, M)
  • age: Age (15–22)
  • address: Home address type (U = urban, R = rural)
  • famsize: Family size (LE3 ≤ 3, GT3 > 3)
  • Pstatus: Parents’ cohabitation status (T = together, A = apart)
  • Medu: Mother’s education (0 none; 1 primary (4th); 2 5th–9th; 3 secondary; 4 higher)
  • Fedu: Father’s education (same scale as Medu)
  • Mjob: Mother’s job (teacher, health, services, at_home, other)
  • Fjob: Father’s job (same categories as Mjob)
  • reason: Reason for choosing the school (home, reputation, course, other)
  • guardian: Student’s guardian (mother, father, other)
  • traveltime: Home→school travel time (1 <15m; 2 15–30m; 3 30–60m; 4 >1h)
  • studytime: Weekly study time (1 <2h; 2 2–5h; 3 5–10h; 4 >10h)
  • failures: Past class failures (n if 1≤n<3, else 4)
  • schoolsup: Extra educational support (yes/no)
  • famsup: Family educational support (yes/no)
  • paid: Extra paid classes within subject (yes/no)
  • activities: Extracurricular activities (yes/no)
  • nursery: Attended nursery school (yes/no)
  • higher: Intends to pursue higher education (yes/no)
  • internet: Internet access at home (yes/no)
  • romantic: In a romantic relationship (yes/no)
  • famrel: Quality of family relationships (1 very bad – 5 excellent)
  • freetime: Free time after school (1 very low – 5 very high)
  • goout: Going out with friends (1 very low – 5 very high)
  • Dalc: Workday alcohol consumption (1 very low – 5 very high)
  • Walc: Weekend alcohol consumption (1 very low – 5 very high)
  • health: Current health status (1 very bad – 5 very good)
  • absences: Number of school absences (0–93)
  • G1: First-period grade (0–20)
  • G2: Second-period grade (0–20)
  • G3: Final grade (target, 0–20)