This project uses the Student Performance dataset from two Portuguese secondary schools, hosted by the UCI Machine Learning Repository. It contains records for 649 students with 33 attributes spanning demographics, family background, study habits, school support, lifestyle factors, absences, and three period grades. Two subject-specific files are provided: Mathematics (student-mat.csv) and Portuguese (student-por.csv). We will analyze only the Mathematics dataset.
2.1 Citation and source
Original paper: Paulo Cortez and Alice M. G. Silva (2008), “Using data mining to predict secondary school student performance,” Proceedings of the 5th Annual Future Business Technology Conference. Link: https://www.semanticscholar.org/paper/61d468d5254730bbecf822c6b60d7d6595d9889c
License: Creative Commons Attribution 4.0 (CC BY 4.0)
2.2 Raw data files
data/raw/student-mat.csv — Mathematics subject records
2.3 Build derived dataset
The read_student_mat() and coerce_types() functions are defined in R/01_data.R. They are written for this project, and customized to this dataset. They read the raw CSV file, and perform type coercion on the columns.
# The Data {#sec-data}```{r}#| label: setup_01_data#| echo: false#| message: false#| warning: false#| include: falsehere::i_am("quarto/01_data.qmd")library(here)library(tidyverse)source(here("R", "01_data.R"))```This project uses the Student Performance dataset from two Portuguese secondary schools, hosted by the [UCI Machine Learning Repository](https://archive.ics.uci.edu/). It contains records for 649 students with 33 attributes spanning demographics, family background, study habits, school support, lifestyle factors, absences, and three period grades. Two subject-specific files are provided: Mathematics (`student-mat.csv`) and Portuguese (`student-por.csv`). We will analyze only the Mathematics dataset. ## Citation and source- Original paper: Paulo Cortez and Alice M. G. Silva (2008), “Using data mining to predict secondary school student performance,” Proceedings of the 5th Annual Future Business Technology Conference. Link: https://www.semanticscholar.org/paper/61d468d5254730bbecf822c6b60d7d6595d9889c- Repository citation: Cortez, P. (2008). Student Performance [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5TG7T- Dataset page: https://archive.ics.uci.edu/dataset/320/student+performance- License: Creative Commons Attribution 4.0 (CC BY 4.0)## Raw data files- `data/raw/student-mat.csv` — Mathematics subject records## Build derived datasetThe `read_student_mat()` and `coerce_types()` functions are defined in `R/01_data.R`. They are written for this project, and customized to this dataset. They read the raw CSV file, and perform type coercion on the columns.```{r}#| message: falseraw_data <-read_student_mat(here("data", "raw", "student-mat.csv"))analysis_data <-coerce_types(raw_data)saveRDS(analysis_data, here("data", "derived", "student_performance.rda"))str(analysis_data)```The the derived "clean data" object is saved to `data/derived/student_performance.rda` for use in later sections.## Variables As described at the links above:- `school`: Student’s school (`GP` = Gabriel Pereira, `MS` = Mousinho da Silveira)- `sex`: Sex (`F`, `M`)- `age`: Age (15–22)- `address`: Home address type (`U` = urban, `R` = rural)- `famsize`: Family size (`LE3` ≤ 3, `GT3` > 3)- `Pstatus`: Parents’ cohabitation status (`T` = together, `A` = apart)- `Medu`: Mother’s education (0 none; 1 primary (4th); 2 5th–9th; 3 secondary; 4 higher)- `Fedu`: Father’s education (same scale as `Medu`)- `Mjob`: Mother’s job (`teacher`, `health`, `services`, `at_home`, `other`)- `Fjob`: Father’s job (same categories as `Mjob`)- `reason`: Reason for choosing the school (`home`, `reputation`, `course`, `other`)- `guardian`: Student’s guardian (`mother`, `father`, `other`)- `traveltime`: Home→school travel time (1 <15m; 2 15–30m; 3 30–60m; 4 >1h)- `studytime`: Weekly study time (1 <2h; 2 2–5h; 3 5–10h; 4 >10h)- `failures`: Past class failures (n if 1≤n<3, else 4)- `schoolsup`: Extra educational support (yes/no)- `famsup`: Family educational support (yes/no)- `paid`: Extra paid classes within subject (yes/no)- `activities`: Extracurricular activities (yes/no)- `nursery`: Attended nursery school (yes/no)- `higher`: Intends to pursue higher education (yes/no)- `internet`: Internet access at home (yes/no)- `romantic`: In a romantic relationship (yes/no)- `famrel`: Quality of family relationships (1 very bad – 5 excellent)- `freetime`: Free time after school (1 very low – 5 very high)- `goout`: Going out with friends (1 very low – 5 very high)- `Dalc`: Workday alcohol consumption (1 very low – 5 very high)- `Walc`: Weekend alcohol consumption (1 very low – 5 very high)- `health`: Current health status (1 very bad – 5 very good)- `absences`: Number of school absences (0–93)- `G1`: First-period grade (0–20)- `G2`: Second-period grade (0–20)- `G3`: Final grade (target, 0–20)