if (!require("tidyverse")) install.packages("tidyverse")
Data Wrangling Practice with R
R-Ladies Rome Tutorials
Introduction
Data preparation is a foundational step in the data science process. It refers to the set of procedures used to clean
, organize
, and structure
data, ensuring it is ready for analysis. Without proper data preparation, even the most sophisticated analytical models can produce unreliable results due to inconsistencies or noise in the data. In practice, data preparation encompasses data wrangling, manipulation, and transformation techniques, which are essential for obtaining high-quality, actionable insights.
Essential Data Preparation Techniques: Wrangling, Manipulation, and Transformation
Data wrangling (also known as data munging) refers to the process of cleaning and organizing raw data to make it suitable for analysis. Data Wrangling also include data manipulation and transformation techniques that help to refine and structure the data for further analysis. These techniques are essential for ensuring data quality and consistency, enabling data scientists and analysts to derive meaningful insights from the data.
Data wrangling involves cleaning raw data handling missing values, correcting errors, and merging datasets—to ensure accuracy and consistency. It also includes structuring data into a tidy format, where each variable is a column, each observation is a row, and each type of observational unit is a table. This format makes it easier to analyze and visualize data effectively.
- Handling Missing Data: Missing values (NA) are common in datasets. R provides several ways to address them, such as using is.na() to detect them, and na.omit() to remove them, or applying imputation methods to fill in missing values.
- Dealing with Duplicates: Identifying and removing duplicate rows is essential to avoid skewed results. The distinct() function in R’s dplyr package makes this task simple.
- Standardizing Data Formats: Ensuring consistent data types is crucial, especially for numeric, character, and factor variables. Functions like mutate() and as.numeric(), as.character(), or as.factor() help transform data into the appropriate format.
Data manipulation builds on this by adjusting, filtering, and transforming specific aspects of the data, such as selecting relevant columns, filtering rows based on conditions, and creating new variables.
- Filtering Data: Filtering allows you to extract specific rows based on conditions. The filter() function in dplyr is useful for this task.
- Sorting Data: Sorting data helps organize it in a meaningful way. The arrange() function in dplyr can sort data by one or more columns.
- Grouping Data: Grouping data allows you to perform calculations on subsets of the data. The group_by() function in dplyr is used to group data by one or more variables.
- Summarizing Data: Summarizing data provides insights into the dataset’s characteristics. The summarize() function in dplyr can calculate summary statistics like mean, median, and standard deviation.
Data transformation goes deeper by reshaping the structure of the data (e.g., converting between wide and long formats), normalizing values, or applying mathematical operations like log transformations to ensure that the data is suitable for complex analysis or modeling.
- Reshaping Data: Reshaping data involves converting between wide and long formats using functions like gather() and spread() in tidyr.
- Normalizing Data: Normalizing data involves scaling values to a standard range, such as between 0 and 1, to ensure consistency across variables.
- Applying Mathematical Operations: Mathematical operations like log transformations, standardization, or normalization can help prepare data for advanced analyses like machine learning.
These processes, often performed using R packages like dplyr, tidyr, and data.table, are crucial steps in converting raw, disorganized data into meaningful insights. Together, they form the foundation for effective data analytics, which uses cleaned and well-structured data to perform descriptive, predictive, and inferential analyses.
Practice with Cardio Data
The data we will be working with is the cardio_train.csv
dataset, which contains information on patients with cardiovascular disease. It can be found on Kaggle
at the following link: Cardiovascular Disease Dataset.
Data Description
The dataset includes the following columns:
Column Name | Description | Variable Name | Data Type |
---|---|---|---|
Age | Objective Feature | age | int (days) |
Height | Objective Feature | height | int (cm) |
Weight | Objective Feature | weight | float (kg) |
Gender | Objective Feature | gender | categorical (1 w, 2 m) |
Systolic blood pressure | Examination Feature | ap_hi | int |
Diastolic blood pressure | Examination Feature | ap_lo | int |
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
Smoking | Subjective Feature | smoke | binary |
Alcohol intake | Subjective Feature | alco | binary |
Physical activity | Subjective Feature | active | binary |
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
Libraries and Data
Install the package if is not already installed:
Load the tidyverse
package:
library(tidyverse)
Import Data
Unzip the file in the data
folder:
unzip("data/cardio_train.zip", exdir = "data")
We load the dataset using the read.csv
function part of the base R package and used to read comma-separated values (CSV) files. It allows us to use the sep = ";"
option, which is necessary for reading the cardio_train.csv
file.
::read.csv ?utils
<- read.csv("data/cardio_train.csv", sep=";") cardio_train
Inspect Data
head(cardio_train)
id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
1 0 18393 2 168 62 110 80 1 1 0 0 1
2 1 20228 1 156 85 140 90 3 1 0 0 1
3 2 18857 1 165 64 130 70 3 1 0 0 0
4 3 17623 2 169 82 150 100 1 1 0 0 1
5 4 17474 1 156 56 100 60 1 1 0 0 0
6 8 21914 1 151 67 120 80 2 2 0 0 0
cardio
1 0
2 1
3 1
4 1
5 0
6 0
%>% glimpse() cardio_train
Rows: 70,000
Columns: 13
$ id <int> 0, 1, 2, 3, 4, 8, 9, 12, 13, 14, 15, 16, 18, 21, 23, 24, 2…
$ age <int> 18393, 20228, 18857, 17623, 17474, 21914, 22113, 22584, 17…
$ gender <int> 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2…
$ height <int> 168, 156, 165, 169, 156, 151, 157, 178, 158, 164, 169, 173…
$ weight <dbl> 62, 85, 64, 82, 56, 67, 93, 95, 71, 68, 80, 60, 60, 78, 95…
$ ap_hi <int> 110, 140, 130, 150, 100, 120, 130, 130, 110, 110, 120, 120…
$ ap_lo <int> 80, 90, 70, 100, 60, 80, 80, 90, 70, 60, 80, 80, 80, 70, 9…
$ cholesterol <int> 1, 3, 3, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ gluc <int> 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1…
$ smoke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
$ alco <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ active <int> 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1…
$ cardio <int> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
This dataset contains 70000 observations and 13 variables. The variables include age (in days), height (in cm), weight (in kg), and others that describe the patients’ health status and lifestyle factors.
Data Wrangling
Handling Missing Data
is.na()
na.omit()
drop_na()
%>% is.na() %>% colSums() cardio_train
id age gender height weight ap_hi
0 0 0 0 0 0
ap_lo cholesterol gluc smoke alco active
0 0 0 0 0 0
cardio
0
There are no missing values in this dataset.
Dealing with Duplicates
distinct()
%>% dim() cardio_train
[1] 70000 13
%>% distinct() %>% nrow() cardio_train
[1] 70000
Rename original data cardio_train
to cardio
for easier reference, and as a best practice to keep the original data intact:
<- cardio_train cardio
Standardizing Data Formats
mutate()
as.numeric()
as.character()
as.factor()
Depending on the type of analysis we want to perform, we may need to convert the age
variable from days to years. Let’s see how we can create a new variable age_years
that represents the age in years, we can use the round
function to round the age in years to the nearest whole number:
%>%
cardio mutate(age_years = round(age / 365.25),
.after = age) %>%
head()
id age age_years gender height weight ap_hi ap_lo cholesterol gluc smoke
1 0 18393 50 2 168 62 110 80 1 1 0
2 1 20228 55 1 156 85 140 90 3 1 0
3 2 18857 52 1 165 64 130 70 3 1 0
4 3 17623 48 2 169 82 150 100 1 1 0
5 4 17474 48 1 156 56 100 60 1 1 0
6 8 21914 60 1 151 67 120 80 2 2 0
alco active cardio
1 0 1 0
2 0 1 1
3 0 0 1
4 0 1 1
5 0 0 0
6 0 0 0
Data Manipulation
Subsetting Data
select()
filter()
Let’s select a subset of columns from the dataset:
%>%
cardio mutate(age_years = round(age / 365.25)) %>%
select(age, age_years) %>%
head()
age age_years
1 18393 50
2 20228 55
3 18857 52
4 17623 48
5 17474 48
6 21914 60
Let’s filter a subset of columns from the dataset:
%>%
cardio mutate(age_years = round(age / 365.25)) %>%
select(age, age_years) %>%
filter(age_years > 50) %>%
head()
age age_years
1 20228 55
2 18857 52
3 21914 60
4 22113 61
5 22584 62
6 19834 54
Grouping and Summarizing Data
group_by()
summarize()
%>%
cardio mutate(age_years = round(age / 365.25)) %>%
group_by(cholesterol) %>%
summarize(mean_age = round(mean(age_years, na.rm = TRUE)),
mean_weight = mean(weight, na.rm = TRUE),
median_height = median(height, na.rm = TRUE)) %>%
head()
# A tibble: 3 × 4
cholesterol mean_age mean_weight median_height
<int> <dbl> <dbl> <dbl>
1 1 53 73.1 165
2 2 54 76.7 164
3 3 56 78.8 163
Creating New Variables
mutate()
Let’s create a new variable bmi
(Body Mass Index) by calculating the weight in kilograms divided by the square of the height in meters:
%>%
cardio mutate(age_years = round(age / 365.25),
bmi = weight / ((height / 100) ^ 2)) %>%
select(cardio, age_years, weight, height, bmi, cholesterol) %>%
head()
cardio age_years weight height bmi cholesterol
1 0 50 62 168 21.96712 1
2 1 55 85 156 34.92768 3
3 1 52 64 165 23.50781 3
4 1 48 82 169 28.71048 1
5 0 48 56 156 23.01118 1
6 0 60 67 151 29.38468 2
Let’s filter the dataset to include only patients with a BMI greater than 30 (obese patients):
%>%
cardio mutate(age_years = round(age / 365.25),
bmi = weight / ((height / 100) ^ 2)) %>%
select(cardio, age_years, weight, height, bmi, cholesterol) %>%
filter(bmi > 30) %>%
head()
cardio age_years weight height bmi cholesterol
1 1 55 85 156 34.92768 3
2 0 61 93 157 37.72973 3
3 0 54 78 158 31.24499 1
4 1 46 112 172 37.85830 1
5 0 54 83 163 31.23941 1
6 1 63 90 158 36.05191 2
%>%
cardio mutate(gender = factor(gender,
levels = c('female' = 1, 'male' = 2),
labels = c('female', 'male')),
bmi = weight / ((height / 100) ^ 2)) %>%
ggplot(aes(x = bmi, fill = gender)) +
geom_histogram(binwidth = 2) +
xlim(0, 100) +
labs(title = "BMI Distribution by Gender",
x = "BMI",
y = "Count")
Sorting Data
arrange()
Let’s sort the dataset by age in descending order:
%>%
cardio mutate(age_years = round(age / 365.25)) %>%
arrange(desc(age)) %>%
select(age, age_years) %>%
head()
age age_years
1 23713 65
2 23701 65
3 23692 65
4 23690 65
5 23687 65
6 23684 65
More Grouping
group_by()
summarize()
Let’s group the data by cardio
, and age
in years and calculate the average BMI
for each age group:
%>%
cardio mutate(age_years = round(age / 365.25),
bmi = weight / ((height / 100) ^ 2)) %>%
group_by(cardio, age_years) %>%
summarize(avg_bmi = mean(bmi, na.rm = TRUE)) %>%
head()
# A tibble: 6 × 3
# Groups: cardio [1]
cardio age_years avg_bmi
<int> <dbl> <dbl>
1 0 30 23.9
2 0 39 26.3
3 0 40 25.7
4 0 41 25.9
5 0 42 25.7
6 0 43 26.0
%>%
cardio mutate(age_years = round(age / 365.25),
bmi = weight / ((height / 100) ^ 2),
cardio = factor(cardio, labels = c("CVD", "NO CVD"))) %>%
group_by(cardio, age_years) %>%
summarize(avg_bmi = mean(bmi, na.rm = TRUE)) %>%
ggplot(aes(x = age_years, y = avg_bmi)) +
geom_point() +
geom_smooth(linewidth = 0.5,
linetype = "dashed",
se = F) +
facet_wrap( ~ cardio, scales = "free")
Summary Statistics
summary()
Let’s calculate the summary statistics for the all variables:
%>%
cardio mutate(age_years = round(age / 365.25),
bmi = weight / ((height / 100) ^ 2),
.after = age) %>%
map_df(~summary(.), .id = "var") %>%
column_to_rownames(var = "var") %>%
::clean_names() %>%
janitormutate(mean = round(mean)) %>%
relocate(mean, median, min, max)
mean median min max x1st_qu x3rd_qu
id 49972 50001.50000 0.000000 99999.0000 25006.75000 74889.25000
age 19469 19703.00000 10798.000000 23713.0000 17664.00000 21327.00000
age_years 53 54.00000 30.000000 65.0000 48.00000 58.00000
bmi 28 26.37407 3.471784 298.6667 23.87511 30.22222
gender 1 1.00000 1.000000 2.0000 1.00000 2.00000
height 164 165.00000 55.000000 250.0000 159.00000 170.00000
weight 74 72.00000 10.000000 200.0000 65.00000 82.00000
ap_hi 129 120.00000 -150.000000 16020.0000 120.00000 140.00000
ap_lo 97 80.00000 -70.000000 11000.0000 80.00000 90.00000
cholesterol 1 1.00000 1.000000 3.0000 1.00000 2.00000
gluc 1 1.00000 1.000000 3.0000 1.00000 1.00000
smoke 0 0.00000 0.000000 1.0000 0.00000 0.00000
alco 0 0.00000 0.000000 1.0000 0.00000 0.00000
active 1 1.00000 0.000000 1.0000 1.00000 1.00000
cardio 0 0.00000 0.000000 1.0000 0.00000 1.00000
Let’s calculate the summary statistics for the numeric variables only:
%>%
cardio mutate(id = row.names("id"),
gender = as.factor(gender),
smoke = as.factor(smoke),
active = as.factor(active),
cardio = as.factor(cardio)) %>%
# summarize stats only for numeric variables
summarize(across(where(is.numeric),
list(mean = mean,
median = median,
sd = sd))) %>%
t()
[,1]
age_mean 1.946887e+04
age_median 1.970300e+04
age_sd 2.467252e+03
height_mean 1.643592e+02
height_median 1.650000e+02
height_sd 8.210126e+00
weight_mean 7.420569e+01
weight_median 7.200000e+01
weight_sd 1.439576e+01
ap_hi_mean 1.288173e+02
ap_hi_median 1.200000e+02
ap_hi_sd 1.540114e+02
ap_lo_mean 9.663041e+01
ap_lo_median 8.000000e+01
ap_lo_sd 1.884725e+02
cholesterol_mean 1.366871e+00
cholesterol_median 1.000000e+00
cholesterol_sd 6.802503e-01
gluc_mean 1.226457e+00
gluc_median 1.000000e+00
gluc_sd 5.722703e-01
alco_mean 5.377143e-02
alco_median 0.000000e+00
alco_sd 2.255677e-01
Data Transformation
Combining Datasets
left_join()
inner_join()
right_join()
full_join()
bind_rows()
/bind_cols()
merge()
In some cases, we may need to combine multiple datasets to perform more complex analyses. We can use one of the join
functions from the dplyr
package to join two datasets based on a common key.
%>%
cardio mutate(age_years = round(age / 365.25)) %>%
pull(age_years) %>%
summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.0 48.0 54.0 53.3 58.0 65.0
Let’s group age
by intervals and count the number of observations in each group:
%>%
cardio mutate(id = row.names("id"),
age_years = round(age / 365.25),
# group age_years in intervals and assign it to a new variable
age_group = cut_interval(age_years,
n = 4,
labels = c("5-14", "15-49", "50-69", "70+")),
.after = age) %>%
arrange(age_years) %>%
count(age_group)
age_group n
1 5-14 4
2 15-49 14569
3 50-69 31137
4 70+ 24290
Let’s create a new dataset cardio2
by adding a new variable bmi
, other transformations to the original dataset cardio
:
<- cardio %>%
cardio2 mutate(id = row.names("id"),
age_years = round(age / 365.25),
age_group = cut_interval(age_years,
n = 4,
labels = c("5-14", "15-49",
"50-69","70+")),
bmi = weight / ((height / 100) ^ 2),
cardio = factor(cardio, levels = c(0,1)),
gender = factor(gender,
levels = c(1,2),
labels = c("female", "male")),
smoke = factor(smoke, levels = c(0,1))) %>%
select(cardio, age_group, gender,
weight, height, smoke, %>%
cholesterol, gluc, bmi) filter(bmi > 30) %>%
distinct()
And now create cardio3
dataset with the mean values for bmi
, weight
, height
, and the count
of observations for each group:
<- cardio2 %>%
cardio3 group_by(cardio, age_group, gender, smoke, cholesterol, gluc) %>%
reframe(mean_bmi = mean(bmi, na.rm = TRUE),
mean_weight = mean(weight, na.rm = TRUE),
mean_height = mean(height, na.rm = TRUE),
count = n())
%>% head() cardio3
# A tibble: 6 × 10
cardio age_group gender smoke cholesterol gluc mean_bmi mean_weight
<fct> <fct> <fct> <fct> <int> <int> <dbl> <dbl>
1 0 5-14 male 0 1 1 30.0 92
2 0 15-49 female 0 1 1 36.0 90.3
3 0 15-49 female 0 1 2 34.6 87.4
4 0 15-49 female 0 1 3 35.6 90.6
5 0 15-49 female 0 2 1 35.3 89.2
6 0 15-49 female 0 2 2 34.6 90.6
# ℹ 2 more variables: mean_height <dbl>, count <int>
Let’s join the cardio3
with a new dataset cardio_deaths
with the deaths rates due to cardiovascular diseases by country and year, data is from the IHME website:
<- read_csv("data/cardio_deaths.csv")
cardio_deaths %>% head() cardio_deaths
# A tibble: 6 × 5
location year age gender value
<chr> <dbl> <chr> <chr> <dbl>
1 Low SDI 1990 15-49 female 34.7
2 Low SDI 1991 15-49 female 34.4
3 Low SDI 1992 15-49 female 34.1
4 Low SDI 1993 15-49 female 33.8
5 Low SDI 1994 15-49 female 33.7
6 Low SDI 1995 15-49 female 33.4
Let’s check the dimensions of the cardio3
and cardio_deaths
datasets:
%>% dim(); cardio3
[1] 200 10
%>% dim() cardio_deaths
[1] 1280 5
Let’s join the cardio3
and cardio_deaths
datasets using the right_join
function from the dplyr
package:
<- cardio_deaths %>%
cardio_join group_by(age, gender) %>%
reframe(avg_deaths = mean(value, na.rm = TRUE)) %>%
right_join(cardio3,
by = c("age" = "age_group", "gender"))
Let’s check the dimensions of the joined dataset and the number of distinct rows:
%>% dim; cardio_join
[1] 200 11
%>% distinct() %>% dim() cardio_join
[1] 200 11
Let’s check the first few rows of the joined dataset:
head(cardio_join)
# A tibble: 6 × 11
age gender avg_deaths cardio smoke cholesterol gluc mean_bmi mean_weight
<chr> <chr> <dbl> <fct> <fct> <int> <int> <dbl> <dbl>
1 15-49 female 24.8 0 0 1 1 36.0 90.3
2 15-49 female 24.8 0 0 1 2 34.6 87.4
3 15-49 female 24.8 0 0 1 3 35.6 90.6
4 15-49 female 24.8 0 0 2 1 35.3 89.2
5 15-49 female 24.8 0 0 2 2 34.6 90.6
6 15-49 female 24.8 0 0 2 3 31.2 72
# ℹ 2 more variables: mean_height <dbl>, count <int>
Tidying Data
pivot_longer()
pivot_wider()
gather()
- the same aspivot_longer()
spread()
- the same aspivot_wider()
separate()
unite()
Let’s pivot the cardio_join
dataset from wide to long format to separate the cholesterol
and gluc
variables into a single column exam_type
and a single column result
:
<- cardio_join %>%
cardio_join_long pivot_longer(cols = c(cholesterol, gluc),
names_to = "exam_type",
values_to = "result") %>%
select(cardio, age, exam_type, result)
%>%
cardio_join_long head()
# A tibble: 6 × 4
cardio age exam_type result
<fct> <chr> <chr> <int>
1 0 15-49 cholesterol 1
2 0 15-49 gluc 1
3 0 15-49 cholesterol 1
4 0 15-49 gluc 2
5 0 15-49 cholesterol 1
6 0 15-49 gluc 3
Let’s check the dimensions of the cardio_join_long
dataset and the number of distinct rows:
%>% dim(); cardio_join_long
[1] 400 4
%>% distinct() %>% dim() cardio_join_long
[1] 38 4
Let’s remove duplicate rows from the cardio_join_long
dataset:
<- cardio_join_long %>%
cardio_join_long distinct()
Let’s pivot the cardio_join_long
dataset back to the original format:
%>%
cardio_join_long group_by(cardio, age) %>%
pivot_wider(names_from = exam_type,
values_from = result) %>%
# unlist the nested columns
unnest(cols = c(cholesterol, gluc)) %>%
head()
# A tibble: 6 × 4
# Groups: cardio, age [2]
cardio age cholesterol gluc
<fct> <chr> <int> <int>
1 0 15-49 1 1
2 0 15-49 2 2
3 0 15-49 3 3
4 1 15-49 1 1
5 1 15-49 2 2
6 1 15-49 3 3
And, let’s separate the age
variable into two columns age_group_start
and age_group_end
:
<- cardio_join %>%
cardio_join_age_group_sep separate(col = "age",
into = c("age_group_start", "age_group_end"),
remove = F) %>%
select(cardio, age_group_start, age_group_end)
%>%
cardio_join_age_group_sep head()
# A tibble: 6 × 3
cardio age_group_start age_group_end
<fct> <chr> <chr>
1 0 15 49
2 0 15 49
3 0 15 49
4 0 15 49
5 0 15 49
6 0 15 49
Then, let’s unite the age_group_start
and age_group_end
columns back into a single column age
:
%>%
cardio_join_age_group_sep unite(col = "age",
sep = "-") %>%
age_group_start, age_group_end, select(cardio, age) %>%
head()
# A tibble: 6 × 2
cardio age
<fct> <chr>
1 0 15-49
2 0 15-49
3 0 15-49
4 0 15-49
5 0 15-49
6 0 15-49
Exporting Data
write.csv()
Once we have cleaned and transformed the data, we may want to export it to a file for further analysis or sharing. We can use the write.csv
function to export the data to a CSV file.
Let’s export the cardio_join
dataset to a CSV file named cardio_cleaned.csv
:
write.csv(cardio_join, "data/cardio_cleaned.csv", row.names = FALSE)
Summary
Data wrangling, manipulation, and transformation are essential techniques for preparing and refining data for analysis. By cleaning, transforming, and organizing raw data, we can ensure its accuracy and consistency, making it suitable for analysis and visualization. These processes form the foundation for effective data analytics, enabling us to derive meaningful insights and make informed decisions based on the data.
In this tutorial, we explored various data wrangling techniques using the cardio_train.csv
dataset, including checking for missing values, transforming variables, filtering and sorting data, grouping data, and calculating summary statistics. By applying these techniques, we gained valuable insights into the dataset and learned how to manipulate and analyze data effectively using R.
Data wrangling is a critical skill for data scientists and analysts, as it allows them to work with raw data and prepare it for analysis. By mastering these techniques, you can ensure that your data is clean, well-structured, and ready for further analysis, enabling you to derive meaningful insights and make informed decisions based on the data.
References
- Cardiovascular Disease Dataset on Kaggle
- IHME website
- R for Data Science by Wickham, H., & Grolemund, G. (2017). O’Reilly Media.
- Tidyverse Learn Documentation
- Tidy Data by Wickham, H. (2014). Journal of Statistical Software, 59(10), 1-23