Data visualization and manipulation with tidyverse package.
Author
Erik De Luca
Published
March 30, 2025
Load Libraries
Code
pacman::p_load( tidyverse, # A set of many useful libraries readxl, # To import the dataset from Excel here, # To avoid problems with file directories janitor, # To clean data in a fast way broom, broom.helpers, gt, # Output tables gtsummary, # Output tables for models and survival data patchwork, # merge more plots statmod # tweedie models)# load plot's themesource(here("R/plot_theme_html.R"))
Warning: package 'showtext' was built under R version 4.5.1
Loading required package: sysfonts
Warning: package 'sysfonts' was built under R version 4.5.1
Loading required package: showtextdb
Warning: package 'showtextdb' was built under R version 4.5.1
Import Data
Code
data <-read_csv(here("data", "FINAL", "dataframe_cleaned.csv") )
Rows: 268530 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): donor_class, donation_type, gender
dbl (8): birth_year, birth_cohort, first_donation_year, first_donation_cohor...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
data |>slice_sample(n =1, by =c(donation_type, gender)) |>rename_all(\(x) str_replace_all(x, "_", " ")) |>gt() |>tab_header("Sample of the dataset cleaned","The sample was stratified by the variables donation_type and gender")
Sample of the dataset cleaned
The sample was stratified by the variables donation_type and gender
donor class
donation type
birth year
birth cohort
first donation year
first donation cohort
number of donations
gender
year
age
unique number
P
SANGUE
1970
1970
1995
1995
2
F
2012
42
26525967
P
PLASMAF
1971
1970
2006
2005
2
M
2017
46
26439003
P
SANGUE
1964
1960
1993
1990
1
M
2009
45
26488779
P
AFPLTPLASM
1961
1960
2007
2005
2
F
2017
56
26872254
P
PLASMAF
1987
1985
2012
2010
2
F
2014
27
26958558
P
AFPLTPLASM
1968
1965
1987
1985
1
M
2011
43
26482980
P
PLTAFE
1959
1955
2000
2000
2
M
2012
53
26642073
P
PLTAFE
1959
1955
1979
1975
1
F
2013
54
26471973
Some Checks
Check the unique donors number
Code
data |>distinct(unique_number) |>nrow()
[1] 36625
Check if there are any duplicate rows
Code
data |> janitor::get_dupes()
No variable names specified - using all columns.
Expand the rows
Add the observation for the years which the donators haven’t donated.
Code
data |>complete(unique_number, year =full_seq(year, 1), fill =list(number_of_donations =0)) |>mutate(across(c(gender, birth_year, birth_cohort, first_donation_year, first_donation_cohort, donation_type, donor_class), \(x) coalesce(x, first(na.omit(x)))),age = year - birth_year,.by = unique_number ) -> data_with_zerosdata_with_zeros |>slice_min(n =1, order_by = unique_number) |>select(unique_number, year, age, number_of_donations)
data |>summarise(number_of_donations =sum(number_of_donations),.by =-c(number_of_donations, unique_number) ) -> summarised_data
Some analysis
Age of donors for the last year
Code
data |>filter( year ==max(year) | year ==min(year) ) |>ggplot(aes(x = gender, y = age, fill = gender)) +geom_violin() +facet_wrap(~year) +labs(title ="Donor age distribution in 2009 and 2023 by gender" )
Code
data |>select(class_year, gender) |>tbl_summary(by = gender ) |>add_overall()
Maybe the donations are reffered to the year. But is it possible to donate 17 times per year?
Code
data |>tabyl(number_of_donations, gender) |>adorn_percentages(denominator ="col") |>adorn_pct_formatting()
Code
data |>tabyl(number_of_donations, donor_class) |>adorn_percentages(denominator ="col") |>adorn_pct_formatting()
Time series
Code
data |>filter( donation_type =="SANGUE", birth_year >1930 ) |>summarise(donations =sum(number_of_donations),.by =c(class_year, year) ) |>ggplot(aes(year, donations, color = class_year, group = class_year)) +geom_line() +scale_color_brewer() +labs(title ="Number of donations by year and class of birth years" )
Code
data |>filter( donation_type =="SANGUE", birth_year >1930 ) |>summarise(donations =sum(number_of_donations),.by =c(class_age, year) ) |>ggplot(aes(year, donations, color = class_age, group = class_age)) +geom_line() +scale_color_brewer(direction =-1) +labs(title ="Number of donations by year and class of birth years" )
Code
data |>summarise(donations =sum(number_of_donations),.by =c(donation_type, year) ) |>ggplot(aes(year, donations, color = donation_type, group = donation_type)) +geom_line() +geom_smooth(linetype ="dashed",alpha = .15,size = .7 ) +scale_color_brewer(type ="qual", palette =2) +labs(title ="Number of donations by year and donation type" )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Code
data |>summarise(donations =sum(number_of_donations),.by =c(donor_class, year) ) |>ggplot(aes(year, donations, color = donor_class, group = donor_class)) +geom_line() +geom_smooth(linetype ="dashed",alpha = .15,size = .7 ) +scale_color_brewer(type ="qual", palette =2) +labs(title ="Number of donations by year and donor class" )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Integrate Data
Integrate data with Trieste residents to get more insights
Rows: 3672 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): ITTER107, Territorio, TIPO_DATO15, Tipo di indicatore demografico, ...
dbl (5): SEXISTAT1, STATCIV2, TIME, Seleziona periodo, Value
lgl (2): Flag Codes, Flags
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
For the years prior to 2019, one could use the 2019 data from ISTAT, adjust the ages, and account for population changes with mortality tables, assuming a constant flow of immigrants and emigrants.
Let \(l_x\) be the population from the mortality table at age \(x\), and let \(n_x^y\) be the resident population in Trieste in year \(y\) at age \(x\). To estimate the population in year \(y_j\) at age \(x\), given the population in year \(y_i\), the following formula is used:
data |>left_join(residenti, by =c("gender", "year", "age")) |>filter(year %%4==3) |>summarise(ratio_donors =n() /first(population),.by =c(gender, year, age) ) |>filter(age <75) |>ggplot(aes(x = age, y = ratio_donors, fill = gender, alpha = year)) +geom_col() +facet_grid(rows =vars(year), cols =vars(gender)) +scale_y_continuous(labels = scales::percent_format()) +scale_alpha_continuous(range =c(.4, 1)) +labs(title ="Donors population ratio by gender and year" ) +theme(legend.position ="none")
Code
data |>left_join(residenti, by =c("gender", "year", "age")) |>reframe(n =n(),.by =c(gender, year, age, class_age, population) ) |>summarise(ratio_donors =sum(n) /sum(population),.by =c(year, gender, class_age) ) |>ggplot(aes(x = year, y = ratio_donors, color = class_age)) +geom_line() +facet_grid(cols =vars(gender)) +scale_y_continuous(labels = scales::percent_format()) +labs(title ="Donors population ratio by gender and year" )
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_line()`).
Statistical models
Analyze the dependent variable
Code
map(c("donor_class", "donation_type", "gender"), \(x) { data |>summarise(n =n(),.by =c(!!sym(x), number_of_donations) ) |>ggplot(aes(x = number_of_donations, y = n, fill =!!sym(x))) +geom_col(position ="dodge") +scale_x_continuous(limits =c(0, 6)) +scale_y_continuous(labels = scales::number_format()) +labs(y ="",x ="") }) |>reduce(`/`)
Warning: Removed 17 rows containing missing values or values outside the scale range
(`geom_col()`).
Warning: Removed 16 rows containing missing values or values outside the scale range
(`geom_col()`).
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_col()`).
Code
map(c("donor_class", "donation_type", "gender"), \(x) { data |>filter(number_of_donations <6) |>summarise(n =n(),.by =c(!!sym(x), number_of_donations) ) |># mutate(n = n / n(), .by = !!sym(x)) |> ggplot(aes(x = number_of_donations, y = n, fill =!!sym(x))) +geom_col() +facet_wrap(vars(!!sym(x))) +scale_y_continuous(labels = scales::number_format()) +labs(y ="",x ="",subtitle = x ) +theme(legend.position ="none") }) |>reduce(`/`)
How many times has a donor donated in last years?
Add a column with the cumsum of the number of donations and expand the data frame adding the no donations for whom donated and for next years no more.
Code
data |>arrange(unique_number, age) |>mutate(total_donations =cumsum(number_of_donations),.by = unique_number,.after = number_of_donations ) -> data
Code
data |>slice_max(total_donations, by = unique_number) |>filter(age <70, age >15) |>ggplot(aes(x = age, y = total_donations)) +# geom_point(position = "jitter") +geom_hex(aes(fill = class_year), alpha = .4) +geom_smooth(aes(color = class_year), linetype ="dashed", method ="lm") +guides(color ="none")
`geom_smooth()` using formula = 'y ~ x'
Code
data |>ggplot(aes(x = total_donations, y =after_stat(density))) +geom_histogram(bins =50) +scale_x_sqrt(breaks =c(0, 1, 2, 5, 10, 25, 50, 100, 200)) +labs(title ="Distribuition of total donations",subtitle ="The x axis is scaled with a square root transformation" )
Maybe I should slice by max donations by person. Other option is just to slice max by year that is the last year of observation that we have. But it’s not the same year for each one.
Code
data |>slice_max(total_donations, by = unique_number) -> data_modeldata_model |>ggplot(aes(x = total_donations, y = ..density..)) +geom_histogram(bins =50) +scale_x_sqrt(breaks =c(0, 1, 2, 5, 10, 25, 50, 100, 200)) +labs(title ="Distribuition of total donations in the last year of donations of each donator",subtitle ="The x axis is scaled with a square root transformation" )
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
Poisson
Code
mod <-glm(total_donations ~ age + class_age + first_donation_year + gender,data = data_model |>filter(total_donations <100),family =poisson(link ="log"),# family = Gamma(link = "inverse"), )summary(mod)
Call:
glm(formula = total_donations ~ age + class_age + first_donation_year +
gender, family = poisson(link = "log"), data = filter(data_model,
total_donations < 100))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.5695674 0.3924627 32.03 <2e-16 ***
age 0.0301606 0.0007126 42.32 <2e-16 ***
class_age(20,30] 0.5547732 0.0136860 40.54 <2e-16 ***
class_age(30,40] 0.7017164 0.0173531 40.44 <2e-16 ***
class_age(40,50] 0.8077123 0.0226534 35.66 <2e-16 ***
class_age(50,60] 0.8238791 0.0284090 29.00 <2e-16 ***
class_age(60,70] 0.5814609 0.0341206 17.04 <2e-16 ***
class_age(70,122] -2.4233497 0.2287379 -10.59 <2e-16 ***
first_donation_year -0.0063649 0.0001933 -32.92 <2e-16 ***
genderM 0.4139014 0.0043410 95.35 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 355031 on 36520 degrees of freedom
Residual deviance: 250199 on 36511 degrees of freedom
AIC: 368065
Number of Fisher Scoring iterations: 5
Code
plot(mod)
Poisson
Code
mod <-glm(total_donations ~ age + class_age + first_donation_year + gender,data = data_model |>filter(total_donations <100),family =quasipoisson(link ="log"),# family = Gamma(link = "inverse"), )summary(mod)
Call:
glm(formula = total_donations ~ age + class_age + first_donation_year +
gender, family = quasipoisson(link = "log"), data = filter(data_model,
total_donations < 100))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.5695674 1.1696809 10.746 < 2e-16 ***
age 0.0301606 0.0021239 14.201 < 2e-16 ***
class_age(20,30] 0.5547732 0.0407891 13.601 < 2e-16 ***
class_age(30,40] 0.7017164 0.0517184 13.568 < 2e-16 ***
class_age(40,50] 0.8077123 0.0675153 11.963 < 2e-16 ***
class_age(50,60] 0.8238791 0.0846692 9.731 < 2e-16 ***
class_age(60,70] 0.5814609 0.1016916 5.718 1.09e-08 ***
class_age(70,122] -2.4233497 0.6817217 -3.555 0.000379 ***
first_donation_year -0.0063649 0.0005762 -11.047 < 2e-16 ***
genderM 0.4139014 0.0129379 31.991 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 8.882559)
Null deviance: 355031 on 36520 degrees of freedom
Residual deviance: 250199 on 36511 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
library(statmod)library(tweedie)library(doParallel)# Set up parallel backendnumCores <-detectCores() -1# Use one less than the total corescl <-makeCluster(numCores)registerDoParallel(cl)# Function to estimate power parameter in paralleltweedie_tuning <-foreach(p =seq(.5, 3, length.out = numCores), .combine = c, .packages ='tweedie') %dopar% {tweedie.profile( total_donations ~ class_age + gender,data = data_filtered |> dplyr::slice_sample(n =1e3, by =c(class_age, gender)),p.vec = p,do.plot =FALSE )}# Stop the cluster after usagestopCluster(cl)registerDoSEQ()# Assuming `tweedie_tuning` is your original listtweedie_tuning_unlisted <- tweedie_tuning |>unlist(use.names =TRUE)# Convert to a data frametweedie_tuning_df <-tibble(name =names(tweedie_tuning_unlisted),value = tweedie_tuning_unlisted)# Group by name and create a wide format data frametweedie_tuning_wide <- tweedie_tuning_df |>group_by(name) |>mutate(row =row_number()) |>pivot_wider(names_from = name, values_from = value) |>select(-row) tweedie_tuning_wide |>mutate(across(c(p, phi), as.numeric)) |>slice_max(phi) |>pull(p) -> best_pbest_p
Profiled confidence intervals may take longer time to compute.
Use `ci_method="wald"` for faster computation of CIs.
Number of labels is greater than default palette color count.
• Select another color `palette` (and/or `package`).
Model on the last year
The Idea is to make a predictable model to provide a prediction for the next year based on the past blood donation of a donor. In this case, we’re not taking into account the donors who in the future won’t donate and the donors who in the past didn’t donate.
The dataframe will be structured in one row for each donor with the information about the past donations.
Pivot the data
We have to decide how to manage the different donation type. We could analyze jointly or separately. Now, we restrict the analysis to blood donation (SANGUE) and donor class equal to P.
Code
data |>filter(donation_type =='SANGUE', donor_class =='P') |>pivot_wider(names_from = year,names_prefix ="y_",values_from = number_of_donations,id_cols = unique_number,values_fill =0 ) |># take who has donated in the last two yearfilter(if_any(c(y_2022, y_2021), \(x) x >0)) -> donationsdonations