Data Wrangling Practice with R

R-Ladies Rome Tutorials

Author

Federica Gazzelloni

Published

September 30, 2024

Introduction

Data preparation is a foundational step in the data science process. It refers to the set of procedures used to clean, organize, and structure data, ensuring it is ready for analysis. Without proper data preparation, even the most sophisticated analytical models can produce unreliable results due to inconsistencies or noise in the data. In practice, data preparation encompasses data wrangling, manipulation, and transformation techniques, which are essential for obtaining high-quality, actionable insights.

Essential Data Preparation Techniques: Wrangling, Manipulation, and Transformation

Data wrangling (also known as data munging) refers to the process of cleaning and organizing raw data to make it suitable for analysis. Data Wrangling also include data manipulation and transformation techniques that help to refine and structure the data for further analysis. These techniques are essential for ensuring data quality and consistency, enabling data scientists and analysts to derive meaningful insights from the data.

Data wrangling involves cleaning raw data handling missing values, correcting errors, and merging datasets—to ensure accuracy and consistency. It also includes structuring data into a tidy format, where each variable is a column, each observation is a row, and each type of observational unit is a table. This format makes it easier to analyze and visualize data effectively.

Handling Missing Data: Missing values (NA) are common in datasets. R provides several ways to address them, such as using is.na() to detect them, and na.omit() to remove them, or applying imputation methods to fill in missing values.
Dealing with Duplicates: Identifying and removing duplicate rows is essential to avoid skewed results. The distinct() function in R’s dplyr package makes this task simple.
Standardizing Data Formats: Ensuring consistent data types is crucial, especially for numeric, character, and factor variables. Functions like mutate() and as.numeric(), as.character(), or as.factor() help transform data into the appropriate format.

Data manipulation builds on this by adjusting, filtering, and transforming specific aspects of the data, such as selecting relevant columns, filtering rows based on conditions, and creating new variables.

Filtering Data: Filtering allows you to extract specific rows based on conditions. The filter() function in dplyr is useful for this task.
Sorting Data: Sorting data helps organize it in a meaningful way. The arrange() function in dplyr can sort data by one or more columns.
Grouping Data: Grouping data allows you to perform calculations on subsets of the data. The group_by() function in dplyr is used to group data by one or more variables.
Summarizing Data: Summarizing data provides insights into the dataset’s characteristics. The summarize() function in dplyr can calculate summary statistics like mean, median, and standard deviation.

Data transformation goes deeper by reshaping the structure of the data (e.g., converting between wide and long formats), normalizing values, or applying mathematical operations like log transformations to ensure that the data is suitable for complex analysis or modeling.

Reshaping Data: Reshaping data involves converting between wide and long formats using functions like gather() and spread() in tidyr.
Normalizing Data: Normalizing data involves scaling values to a standard range, such as between 0 and 1, to ensure consistency across variables.
Applying Mathematical Operations: Mathematical operations like log transformations, standardization, or normalization can help prepare data for advanced analyses like machine learning.

These processes, often performed using R packages like dplyr, tidyr, and data.table, are crucial steps in converting raw, disorganized data into meaningful insights. Together, they form the foundation for effective data analytics, which uses cleaned and well-structured data to perform descriptive, predictive, and inferential analyses.

Practice with Cardio Data

The data we will be working with is the cardio_train.csv dataset, which contains information on patients with cardiovascular disease. It can be found on Kaggle at the following link: Cardiovascular Disease Dataset.

Data Description

The dataset includes the following columns:

Column Name	Description	Variable Name	Data Type
Age	Objective Feature	age	int (days)
Height	Objective Feature	height	int (cm)
Weight	Objective Feature	weight	float (kg)
Gender	Objective Feature	gender	categorical (1 w, 2 m)
Systolic blood pressure	Examination Feature	ap_hi	int
Diastolic blood pressure	Examination Feature	ap_lo	int
Cholesterol	Examination Feature	cholesterol	1: normal, 2: above normal, 3: well above normal
Glucose	Examination Feature	gluc	1: normal, 2: above normal, 3: well above normal
Smoking	Subjective Feature	smoke	binary
Alcohol intake	Subjective Feature	alco	binary
Physical activity	Subjective Feature	active	binary
Presence or absence of cardiovascular disease	Target Variable	cardio	binary

Libraries and Data

Install the package if is not already installed:

if (!require("tidyverse")) install.packages("tidyverse")

Load the tidyverse package:

library(tidyverse)

Import Data

Unzip the file in the data folder:

unzip("data/cardio_train.zip", exdir = "data")

We load the dataset using the read.csv function part of the base R package and used to read comma-separated values (CSV) files. It allows us to use the sep = ";" option, which is necessary for reading the cardio_train.csv file.

?utils::read.csv

cardio_train <- read.csv("data/cardio_train.csv", sep=";")

Inspect Data

head(cardio_train)

  id   age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
1  0 18393      2    168     62   110    80           1    1     0    0      1
2  1 20228      1    156     85   140    90           3    1     0    0      1
3  2 18857      1    165     64   130    70           3    1     0    0      0
4  3 17623      2    169     82   150   100           1    1     0    0      1
5  4 17474      1    156     56   100    60           1    1     0    0      0
6  8 21914      1    151     67   120    80           2    2     0    0      0
  cardio
1      0
2      1
3      1
4      1
5      0
6      0

cardio_train %>% glimpse()

Rows: 70,000
Columns: 13
$ id          <int> 0, 1, 2, 3, 4, 8, 9, 12, 13, 14, 15, 16, 18, 21, 23, 24, 2…
$ age         <int> 18393, 20228, 18857, 17623, 17474, 21914, 22113, 22584, 17…
$ gender      <int> 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2…
$ height      <int> 168, 156, 165, 169, 156, 151, 157, 178, 158, 164, 169, 173…
$ weight      <dbl> 62, 85, 64, 82, 56, 67, 93, 95, 71, 68, 80, 60, 60, 78, 95…
$ ap_hi       <int> 110, 140, 130, 150, 100, 120, 130, 130, 110, 110, 120, 120…
$ ap_lo       <int> 80, 90, 70, 100, 60, 80, 80, 90, 70, 60, 80, 80, 80, 70, 9…
$ cholesterol <int> 1, 3, 3, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ gluc        <int> 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1…
$ smoke       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
$ alco        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ active      <int> 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1…
$ cardio      <int> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…

This dataset contains 70000 observations and 13 variables. The variables include age (in days), height (in cm), weight (in kg), and others that describe the patients’ health status and lifestyle factors.

Data Wrangling

Handling Missing Data

is.na()
na.omit()
drop_na()

cardio_train %>% is.na() %>% colSums()

         id         age      gender      height      weight       ap_hi 
          0           0           0           0           0           0 
      ap_lo cholesterol        gluc       smoke        alco      active 
          0           0           0           0           0           0 
     cardio 
          0

There are no missing values in this dataset.

Dealing with Duplicates

distinct()

cardio_train %>% dim()

[1] 70000    13

cardio_train %>% distinct() %>% nrow()

[1] 70000

Rename original data cardio_train to cardio for easier reference, and as a best practice to keep the original data intact:

cardio <- cardio_train

Standardizing Data Formats

mutate()
as.numeric()
as.character()
as.factor()

Depending on the type of analysis we want to perform, we may need to convert the age variable from days to years. Let’s see how we can create a new variable age_years that represents the age in years, we can use the round function to round the age in years to the nearest whole number:

cardio %>% 
  mutate(age_years = round(age / 365.25), 
         .after = age) %>%
  head()

  id   age age_years gender height weight ap_hi ap_lo cholesterol gluc smoke
1  0 18393        50      2    168     62   110    80           1    1     0
2  1 20228        55      1    156     85   140    90           3    1     0
3  2 18857        52      1    165     64   130    70           3    1     0
4  3 17623        48      2    169     82   150   100           1    1     0
5  4 17474        48      1    156     56   100    60           1    1     0
6  8 21914        60      1    151     67   120    80           2    2     0
  alco active cardio
1    0      1      0
2    0      1      1
3    0      0      1
4    0      1      1
5    0      0      0
6    0      0      0

Data Manipulation

Subsetting Data

select()
filter()

Let’s select a subset of columns from the dataset:

cardio %>% 
  mutate(age_years = round(age / 365.25)) %>%
  select(age, age_years) %>%
  head()

    age age_years
1 18393        50
2 20228        55
3 18857        52
4 17623        48
5 17474        48
6 21914        60

Let’s filter a subset of columns from the dataset:

cardio %>% 
  mutate(age_years = round(age / 365.25)) %>%
  select(age, age_years) %>%
  filter(age_years > 50) %>%
  head()

    age age_years
1 20228        55
2 18857        52
3 21914        60
4 22113        61
5 22584        62
6 19834        54

Grouping and Summarizing Data

group_by()
summarize()

cardio %>% 
  mutate(age_years = round(age / 365.25)) %>%
  group_by(cholesterol) %>%
  summarize(mean_age = round(mean(age_years, na.rm = TRUE)),
            mean_weight = mean(weight, na.rm = TRUE),
            median_height = median(height, na.rm = TRUE)) %>%
  head()

# A tibble: 3 × 4
  cholesterol mean_age mean_weight median_height
        <int>    <dbl>       <dbl>         <dbl>
1           1       53        73.1           165
2           2       54        76.7           164
3           3       56        78.8           163

Creating New Variables

mutate()

Let’s create a new variable bmi (Body Mass Index) by calculating the weight in kilograms divided by the square of the height in meters:

cardio %>% 
  mutate(age_years = round(age / 365.25),
         bmi = weight / ((height / 100) ^ 2)) %>%
  select(cardio, age_years, weight, height, bmi, cholesterol) %>%
  head()

  cardio age_years weight height      bmi cholesterol
1      0        50     62    168 21.96712           1
2      1        55     85    156 34.92768           3
3      1        52     64    165 23.50781           3
4      1        48     82    169 28.71048           1
5      0        48     56    156 23.01118           1
6      0        60     67    151 29.38468           2

Let’s filter the dataset to include only patients with a BMI greater than 30 (obese patients):

cardio %>% 
  mutate(age_years = round(age / 365.25),
         bmi = weight / ((height / 100) ^ 2)) %>%
  select(cardio, age_years, weight, height, bmi, cholesterol)  %>%
  filter(bmi > 30) %>%
  head()

  cardio age_years weight height      bmi cholesterol
1      1        55     85    156 34.92768           3
2      0        61     93    157 37.72973           3
3      0        54     78    158 31.24499           1
4      1        46    112    172 37.85830           1
5      0        54     83    163 31.23941           1
6      1        63     90    158 36.05191           2

cardio %>% 
  mutate(gender = factor(gender, 
                         levels = c('female' = 1, 'male' = 2), 
                         labels = c('female', 'male')),
         bmi = weight / ((height / 100) ^ 2)) %>%
  ggplot(aes(x = bmi, fill = gender)) + 
  geom_histogram(binwidth = 2) +
  xlim(0, 100) +
  labs(title = "BMI Distribution by Gender",
       x = "BMI",
       y = "Count")

Sorting Data

arrange()

Let’s sort the dataset by age in descending order:

cardio %>% 
  mutate(age_years = round(age / 365.25)) %>%
  arrange(desc(age)) %>%
  select(age, age_years) %>%
  head()

    age age_years
1 23713        65
2 23701        65
3 23692        65
4 23690        65
5 23687        65
6 23684        65

More Grouping

group_by()
summarize()

Let’s group the data by cardio, and age in years and calculate the average BMI for each age group:

cardio %>% 
  mutate(age_years = round(age / 365.25),
         bmi = weight / ((height / 100) ^ 2)) %>%
  group_by(cardio, age_years) %>%
  summarize(avg_bmi = mean(bmi, na.rm = TRUE)) %>%
  head()

# A tibble: 6 × 3
# Groups:   cardio [1]
  cardio age_years avg_bmi
   <int>     <dbl>   <dbl>
1      0        30    23.9
2      0        39    26.3
3      0        40    25.7
4      0        41    25.9
5      0        42    25.7
6      0        43    26.0

cardio %>% 
  mutate(age_years = round(age / 365.25),
         bmi = weight / ((height / 100) ^ 2),
         cardio = factor(cardio, labels = c("CVD", "NO CVD"))) %>%
  group_by(cardio, age_years) %>%
  summarize(avg_bmi = mean(bmi, na.rm = TRUE)) %>%
  ggplot(aes(x = age_years, y = avg_bmi)) +
  geom_point() +
  geom_smooth(linewidth = 0.5, 
              linetype = "dashed",
              se = F) +
  facet_wrap( ~ cardio, scales = "free")

Summary Statistics

summary()

Let’s calculate the summary statistics for the all variables:

cardio %>% 
  mutate(age_years = round(age / 365.25),
         bmi = weight / ((height / 100) ^ 2),
         .after = age) %>%
  map_df(~summary(.), .id = "var") %>%
  column_to_rownames(var = "var") %>%
  janitor::clean_names() %>%
  mutate(mean = round(mean)) %>%
  relocate(mean, median, min, max)

             mean      median          min        max     x1st_qu     x3rd_qu
id          49972 50001.50000     0.000000 99999.0000 25006.75000 74889.25000
age         19469 19703.00000 10798.000000 23713.0000 17664.00000 21327.00000
age_years      53    54.00000    30.000000    65.0000    48.00000    58.00000
bmi            28    26.37407     3.471784   298.6667    23.87511    30.22222
gender          1     1.00000     1.000000     2.0000     1.00000     2.00000
height        164   165.00000    55.000000   250.0000   159.00000   170.00000
weight         74    72.00000    10.000000   200.0000    65.00000    82.00000
ap_hi         129   120.00000  -150.000000 16020.0000   120.00000   140.00000
ap_lo          97    80.00000   -70.000000 11000.0000    80.00000    90.00000
cholesterol     1     1.00000     1.000000     3.0000     1.00000     2.00000
gluc            1     1.00000     1.000000     3.0000     1.00000     1.00000
smoke           0     0.00000     0.000000     1.0000     0.00000     0.00000
alco            0     0.00000     0.000000     1.0000     0.00000     0.00000
active          1     1.00000     0.000000     1.0000     1.00000     1.00000
cardio          0     0.00000     0.000000     1.0000     0.00000     1.00000

Let’s calculate the summary statistics for the numeric variables only:

cardio %>%
  mutate(id = row.names("id"),
         gender = as.factor(gender),
         smoke = as.factor(smoke),
         active = as.factor(active),
         cardio = as.factor(cardio)) %>%
  # summarize stats only for numeric variables
  summarize(across(where(is.numeric), 
                   list(mean = mean, 
                        median = median, 
                        sd = sd))) %>%
  t()

                           [,1]
age_mean           1.946887e+04
age_median         1.970300e+04
age_sd             2.467252e+03
height_mean        1.643592e+02
height_median      1.650000e+02
height_sd          8.210126e+00
weight_mean        7.420569e+01
weight_median      7.200000e+01
weight_sd          1.439576e+01
ap_hi_mean         1.288173e+02
ap_hi_median       1.200000e+02
ap_hi_sd           1.540114e+02
ap_lo_mean         9.663041e+01
ap_lo_median       8.000000e+01
ap_lo_sd           1.884725e+02
cholesterol_mean   1.366871e+00
cholesterol_median 1.000000e+00
cholesterol_sd     6.802503e-01
gluc_mean          1.226457e+00
gluc_median        1.000000e+00
gluc_sd            5.722703e-01
alco_mean          5.377143e-02
alco_median        0.000000e+00
alco_sd            2.255677e-01

Data Transformation

Combining Datasets

left_join()
inner_join()
right_join()
full_join()
bind_rows()/bind_cols()
merge()

In some cases, we may need to combine multiple datasets to perform more complex analyses. We can use one of the join functions from the dplyr package to join two datasets based on a common key.

cardio %>% 
  mutate(age_years = round(age / 365.25)) %>%
  pull(age_years) %>%
  summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   30.0    48.0    54.0    53.3    58.0    65.0

Let’s group age by intervals and count the number of observations in each group:

cardio %>% 
  mutate(id = row.names("id"),
         age_years = round(age / 365.25),
         # group age_years in intervals and assign it to a new variable
         age_group = cut_interval(age_years, 
                                  n = 4, 
                                  labels = c("5-14", "15-49", "50-69", "70+")),
         .after = age) %>%
  arrange(age_years) %>%
  count(age_group)

  age_group     n
1      5-14     4
2     15-49 14569
3     50-69 31137
4       70+ 24290

Let’s create a new dataset cardio2 by adding a new variable bmi, other transformations to the original dataset cardio:

cardio2 <- cardio %>% 
  mutate(id = row.names("id"),
         age_years = round(age / 365.25),
         age_group = cut_interval(age_years, 
                                  n = 4, 
                                  labels = c("5-14", "15-49", 
                                             "50-69","70+")),
         bmi = weight / ((height / 100) ^ 2),
         cardio = factor(cardio, levels = c(0,1)),
         gender = factor(gender, 
                         levels = c(1,2),
                         labels = c("female", "male")),
         smoke = factor(smoke, levels = c(0,1))) %>%
  select(cardio, age_group, gender, 
         weight, height, smoke, 
         cholesterol, gluc, bmi)  %>%
  filter(bmi > 30) %>% 
  distinct()

And now create cardio3 dataset with the mean values for bmi, weight, height, and the count of observations for each group:

cardio3 <- cardio2 %>%
  group_by(cardio, age_group, gender, smoke, cholesterol, gluc) %>%
  reframe(mean_bmi = mean(bmi, na.rm = TRUE),
          mean_weight = mean(weight, na.rm = TRUE),
          mean_height = mean(height, na.rm = TRUE),
          count = n())

cardio3 %>% head()

# A tibble: 6 × 10
  cardio age_group gender smoke cholesterol  gluc mean_bmi mean_weight
  <fct>  <fct>     <fct>  <fct>       <int> <int>    <dbl>       <dbl>
1 0      5-14      male   0               1     1     30.0        92  
2 0      15-49     female 0               1     1     36.0        90.3
3 0      15-49     female 0               1     2     34.6        87.4
4 0      15-49     female 0               1     3     35.6        90.6
5 0      15-49     female 0               2     1     35.3        89.2
6 0      15-49     female 0               2     2     34.6        90.6
# ℹ 2 more variables: mean_height <dbl>, count <int>

Let’s join the cardio3 with a new dataset cardio_deaths with the deaths rates due to cardiovascular diseases by country and year, data is from the IHME website:

cardio_deaths <- read_csv("data/cardio_deaths.csv")
cardio_deaths %>% head()

# A tibble: 6 × 5
  location  year age   gender value
  <chr>    <dbl> <chr> <chr>  <dbl>
1 Low SDI   1990 15-49 female  34.7
2 Low SDI   1991 15-49 female  34.4
3 Low SDI   1992 15-49 female  34.1
4 Low SDI   1993 15-49 female  33.8
5 Low SDI   1994 15-49 female  33.7
6 Low SDI   1995 15-49 female  33.4

Let’s check the dimensions of the cardio3 and cardio_deaths datasets:

cardio3 %>% dim();

[1] 200  10

cardio_deaths %>% dim()

[1] 1280    5

Let’s join the cardio3 and cardio_deaths datasets using the right_join function from the dplyr package:

cardio_join <- cardio_deaths %>% 
  group_by(age, gender) %>%
  reframe(avg_deaths = mean(value, na.rm = TRUE)) %>%
  right_join(cardio3, 
            by = c("age" = "age_group", "gender"))

Let’s check the dimensions of the joined dataset and the number of distinct rows:

cardio_join %>% dim;

[1] 200  11

cardio_join %>% distinct() %>% dim()

[1] 200  11

Let’s check the first few rows of the joined dataset:

head(cardio_join)

# A tibble: 6 × 11
  age   gender avg_deaths cardio smoke cholesterol  gluc mean_bmi mean_weight
  <chr> <chr>       <dbl> <fct>  <fct>       <int> <int>    <dbl>       <dbl>
1 15-49 female       24.8 0      0               1     1     36.0        90.3
2 15-49 female       24.8 0      0               1     2     34.6        87.4
3 15-49 female       24.8 0      0               1     3     35.6        90.6
4 15-49 female       24.8 0      0               2     1     35.3        89.2
5 15-49 female       24.8 0      0               2     2     34.6        90.6
6 15-49 female       24.8 0      0               2     3     31.2        72  
# ℹ 2 more variables: mean_height <dbl>, count <int>

Tidying Data

pivot_longer()
pivot_wider()
gather() - the same as pivot_longer()
spread() - the same as pivot_wider()
separate()
unite()

Let’s pivot the cardio_join dataset from wide to long format to separate the cholesterol and gluc variables into a single column exam_type and a single column result:

cardio_join_long <- cardio_join %>% 
  pivot_longer(cols = c(cholesterol, gluc),
               names_to = "exam_type",
               values_to = "result") %>%
  select(cardio, age, exam_type, result) 

cardio_join_long %>%
  head()

# A tibble: 6 × 4
  cardio age   exam_type   result
  <fct>  <chr> <chr>        <int>
1 0      15-49 cholesterol      1
2 0      15-49 gluc             1
3 0      15-49 cholesterol      1
4 0      15-49 gluc             2
5 0      15-49 cholesterol      1
6 0      15-49 gluc             3

Let’s check the dimensions of the cardio_join_long dataset and the number of distinct rows:

cardio_join_long %>% dim();

[1] 400   4

cardio_join_long %>% distinct() %>% dim()

[1] 38  4

Let’s remove duplicate rows from the cardio_join_long dataset:

cardio_join_long <- cardio_join_long %>% 
  distinct()

Let’s pivot the cardio_join_long dataset back to the original format:

cardio_join_long %>% 
  group_by(cardio, age) %>%
  pivot_wider(names_from = exam_type,
              values_from = result) %>%
  # unlist the nested columns
  unnest(cols = c(cholesterol, gluc)) %>%
  head()

# A tibble: 6 × 4
# Groups:   cardio, age [2]
  cardio age   cholesterol  gluc
  <fct>  <chr>       <int> <int>
1 0      15-49           1     1
2 0      15-49           2     2
3 0      15-49           3     3
4 1      15-49           1     1
5 1      15-49           2     2
6 1      15-49           3     3

And, let’s separate the age variable into two columns age_group_start and age_group_end:

cardio_join_age_group_sep <- cardio_join %>%
  separate(col = "age",
           into = c("age_group_start", "age_group_end"),
           remove = F) %>%
  select(cardio, age_group_start, age_group_end) 

cardio_join_age_group_sep %>%
  head()

# A tibble: 6 × 3
  cardio age_group_start age_group_end
  <fct>  <chr>           <chr>        
1 0      15              49           
2 0      15              49           
3 0      15              49           
4 0      15              49           
5 0      15              49           
6 0      15              49

Then, let’s unite the age_group_start and age_group_end columns back into a single column age:

cardio_join_age_group_sep %>%
  unite(col = "age",
        age_group_start, age_group_end, sep = "-") %>%
  select(cardio, age) %>%
  head()

# A tibble: 6 × 2
  cardio age  
  <fct>  <chr>
1 0      15-49
2 0      15-49
3 0      15-49
4 0      15-49
5 0      15-49
6 0      15-49

Exporting Data

write.csv()

Once we have cleaned and transformed the data, we may want to export it to a file for further analysis or sharing. We can use the write.csv function to export the data to a CSV file.

Let’s export the cardio_join dataset to a CSV file named cardio_cleaned.csv:

write.csv(cardio_join, "data/cardio_cleaned.csv", row.names = FALSE)

Summary

Data wrangling, manipulation, and transformation are essential techniques for preparing and refining data for analysis. By cleaning, transforming, and organizing raw data, we can ensure its accuracy and consistency, making it suitable for analysis and visualization. These processes form the foundation for effective data analytics, enabling us to derive meaningful insights and make informed decisions based on the data.

In this tutorial, we explored various data wrangling techniques using the cardio_train.csv dataset, including checking for missing values, transforming variables, filtering and sorting data, grouping data, and calculating summary statistics. By applying these techniques, we gained valuable insights into the dataset and learned how to manipulate and analyze data effectively using R.

Data wrangling is a critical skill for data scientists and analysts, as it allows them to work with raw data and prepare it for analysis. By mastering these techniques, you can ensure that your data is clean, well-structured, and ready for further analysis, enabling you to derive meaningful insights and make informed decisions based on the data.

References

Cardiovascular Disease Dataset on Kaggle
IHME website
R for Data Science by Wickham, H., & Grolemund, G. (2017). O’Reilly Media.
Tidyverse Learn Documentation
Tidy Data by Wickham, H. (2014). Journal of Statistical Software, 59(10), 1-23