blood with some electronic circuit, like a mother board red and white blood with some electronic circuit, like a mother board red and white Blood Donation Predictor
  • Home
  • Thesis
  • Notebooks
  • Slides
    • Master thesis presentation
    • Probabilistic Machine Learning exam presentation

On this page

  • Import cleaned data
  • Data preparetion
  • Models
    • Poisson
    • Quasi-Poisson
    • Tweedie
    • Gamma
    • Inverse Gaussian (GLM, link = log)
    • Poisson troncata a zero (glmmTMB)
    • Binomiale Negativa (MASS::glm.nb, link = log)
    • Confronto capacità predittiva (test set)
  • Lasso model

Models in R

GLMs

R
GLM
Tidymodels
Modeling with tidymodels package.
Author

Erik De Luca

Published

September 3, 2025

Import cleaned data

Code
here::i_am("notebooks/models_in_R.qmd")
here() starts at D:/GitHub/Blood-Donors
Code
options(contrasts = c("contr.treatment", "contr.treatment"))
source(here::here("R/data_elaboration.R"))
Rows: 268530 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): donor_class, donation_type, gender
dbl (8): birth_year, birth_cohort, first_donation_year, first_donation_cohor...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
source(here::here("R/data_integration_istat.R"))
Rows: 3672 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): ITTER107, Territorio, TIPO_DATO15, Tipo di indicatore demografico, ...
dbl (5): SEXISTAT1, STATCIV2, TIME, Seleziona periodo, Value
lgl (2): Flag Codes, Flags

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
source(here::here("R/plot_theme_html.R"))
pacman::p_load(
    tidymodels,
    poissonreg,
    dotwhisker,
    doParallel,
    parallel,
    glmmTMB,
    MASS,
    broom,
    conflicted,
    broom.mixed
    )
conflict_prefer_all("dplyr")
[conflicted] Will prefer dplyr::`%>%` over any other package.
[conflicted] Will prefer dplyr::across over any other package.
[conflicted] Will prefer dplyr::add_count over any other package.
[conflicted] Will prefer dplyr::add_count_ over any other package.
[conflicted] Will prefer dplyr::add_row over any other package.
[conflicted] Will prefer dplyr::add_rownames over any other package.
[conflicted] Will prefer dplyr::add_tally over any other package.
[conflicted] Will prefer dplyr::add_tally_ over any other package.
[conflicted] Will prefer dplyr::all_equal over any other package.
[conflicted] Will prefer dplyr::all_of over any other package.
[conflicted] Will prefer dplyr::all_vars over any other package.
[conflicted] Will prefer dplyr::anti_join over any other package.
[conflicted] Will prefer dplyr::any_of over any other package.
[conflicted] Will prefer dplyr::any_vars over any other package.
[conflicted] Will prefer dplyr::arrange over any other package.
[conflicted] Will prefer dplyr::arrange_ over any other package.
[conflicted] Will prefer dplyr::arrange_all over any other package.
[conflicted] Will prefer dplyr::arrange_at over any other package.
[conflicted] Will prefer dplyr::arrange_if over any other package.
[conflicted] Will prefer dplyr::as.tbl over any other package.
[conflicted] Will prefer dplyr::as_data_frame over any other package.
[conflicted] Will prefer dplyr::as_label over any other package.
[conflicted] Will prefer dplyr::as_tibble over any other package.
[conflicted] Will prefer dplyr::auto_copy over any other package.
[conflicted] Will prefer dplyr::band_instruments over any other package.
[conflicted] Will prefer dplyr::band_instruments2 over any other package.
[conflicted] Will prefer dplyr::band_members over any other package.
[conflicted] Will prefer dplyr::bench_tbls over any other package.
[conflicted] Will prefer dplyr::between over any other package.
[conflicted] Will prefer dplyr::bind_cols over any other package.
[conflicted] Will prefer dplyr::bind_rows over any other package.
[conflicted] Will prefer dplyr::c_across over any other package.
[conflicted] Will prefer dplyr::case_match over any other package.
[conflicted] Will prefer dplyr::case_when over any other package.
[conflicted] Will prefer dplyr::changes over any other package.
[conflicted] Will prefer dplyr::check_dbplyr over any other package.
[conflicted] Will prefer dplyr::coalesce over any other package.
[conflicted] Will prefer dplyr::collapse over any other package.
[conflicted] Will prefer dplyr::collect over any other package.
[conflicted] Will prefer dplyr::combine over any other package.
[conflicted] Will prefer dplyr::common_by over any other package.
[conflicted] Will prefer dplyr::compare_tbls over any other package.
[conflicted] Will prefer dplyr::compare_tbls2 over any other package.
[conflicted] Will prefer dplyr::compute over any other package.
[conflicted] Will prefer dplyr::consecutive_id over any other package.
[conflicted] Will prefer dplyr::contains over any other package.
[conflicted] Will prefer dplyr::copy_to over any other package.
[conflicted] Will prefer dplyr::count over any other package.
[conflicted] Will prefer dplyr::count_ over any other package.
[conflicted] Will prefer dplyr::cross_join over any other package.
[conflicted] Will prefer dplyr::cumall over any other package.
[conflicted] Will prefer dplyr::cumany over any other package.
[conflicted] Will prefer dplyr::cume_dist over any other package.
[conflicted] Will prefer dplyr::cummean over any other package.
[conflicted] Will prefer dplyr::cur_column over any other package.
[conflicted] Will prefer dplyr::cur_data over any other package.
[conflicted] Will prefer dplyr::cur_data_all over any other package.
[conflicted] Will prefer dplyr::cur_group over any other package.
[conflicted] Will prefer dplyr::cur_group_id over any other package.
[conflicted] Will prefer dplyr::cur_group_rows over any other package.
[conflicted] Will prefer dplyr::current_vars over any other package.
[conflicted] Will prefer dplyr::data_frame over any other package.
[conflicted] Will prefer dplyr::db_analyze over any other package.
[conflicted] Will prefer dplyr::db_begin over any other package.
[conflicted] Will prefer dplyr::db_commit over any other package.
[conflicted] Will prefer dplyr::db_create_index over any other package.
[conflicted] Will prefer dplyr::db_create_indexes over any other package.
[conflicted] Will prefer dplyr::db_create_table over any other package.
[conflicted] Will prefer dplyr::db_data_type over any other package.
[conflicted] Will prefer dplyr::db_desc over any other package.
[conflicted] Will prefer dplyr::db_drop_table over any other package.
[conflicted] Will prefer dplyr::db_explain over any other package.
[conflicted] Will prefer dplyr::db_has_table over any other package.
[conflicted] Will prefer dplyr::db_insert_into over any other package.
[conflicted] Will prefer dplyr::db_list_tables over any other package.
[conflicted] Will prefer dplyr::db_query_fields over any other package.
[conflicted] Will prefer dplyr::db_query_rows over any other package.
[conflicted] Will prefer dplyr::db_rollback over any other package.
[conflicted] Will prefer dplyr::db_save_query over any other package.
[conflicted] Will prefer dplyr::db_write_table over any other package.
[conflicted] Will prefer dplyr::dense_rank over any other package.
[conflicted] Will prefer dplyr::desc over any other package.
[conflicted] Will prefer dplyr::dim_desc over any other package.
[conflicted] Will prefer dplyr::distinct over any other package.
[conflicted] Will prefer dplyr::distinct_ over any other package.
[conflicted] Will prefer dplyr::distinct_all over any other package.
[conflicted] Will prefer dplyr::distinct_at over any other package.
[conflicted] Will prefer dplyr::distinct_if over any other package.
[conflicted] Will prefer dplyr::distinct_prepare over any other package.
[conflicted] Will prefer dplyr::do over any other package.
[conflicted] Will prefer dplyr::do_ over any other package.
[conflicted] Will prefer dplyr::dplyr_col_modify over any other package.
[conflicted] Will prefer dplyr::dplyr_reconstruct over any other package.
[conflicted] Will prefer dplyr::dplyr_row_slice over any other package.
[conflicted] Will prefer dplyr::ends_with over any other package.
[conflicted] Will prefer dplyr::enexpr over any other package.
[conflicted] Will prefer dplyr::enexprs over any other package.
[conflicted] Will prefer dplyr::enquo over any other package.
[conflicted] Will prefer dplyr::enquos over any other package.
[conflicted] Will prefer dplyr::ensym over any other package.
[conflicted] Will prefer dplyr::ensyms over any other package.
[conflicted] Will prefer dplyr::eval_tbls over any other package.
[conflicted] Will prefer dplyr::eval_tbls2 over any other package.
[conflicted] Will prefer dplyr::everything over any other package.
[conflicted] Will prefer dplyr::explain over any other package.
[conflicted] Will prefer dplyr::expr over any other package.
[conflicted] Will prefer dplyr::failwith over any other package.
[conflicted] Will prefer dplyr::filter over any other package.
[conflicted] Will prefer dplyr::filter_ over any other package.
[conflicted] Will prefer dplyr::filter_all over any other package.
[conflicted] Will prefer dplyr::filter_at over any other package.
[conflicted] Will prefer dplyr::filter_if over any other package.
[conflicted] Will prefer dplyr::first over any other package.
[conflicted] Will prefer dplyr::full_join over any other package.
[conflicted] Will prefer dplyr::funs over any other package.
[conflicted] Will prefer dplyr::funs_ over any other package.
[conflicted] Will prefer dplyr::glimpse over any other package.
[conflicted] Will prefer dplyr::group_by over any other package.
[conflicted] Will prefer dplyr::group_by_ over any other package.
[conflicted] Will prefer dplyr::group_by_all over any other package.
[conflicted] Will prefer dplyr::group_by_at over any other package.
[conflicted] Will prefer dplyr::group_by_drop_default over any other package.
[conflicted] Will prefer dplyr::group_by_if over any other package.
[conflicted] Will prefer dplyr::group_by_prepare over any other package.
[conflicted] Will prefer dplyr::group_cols over any other package.
[conflicted] Will prefer dplyr::group_data over any other package.
[conflicted] Will prefer dplyr::group_indices over any other package.
[conflicted] Will prefer dplyr::group_indices_ over any other package.
[conflicted] Will prefer dplyr::group_keys over any other package.
[conflicted] Will prefer dplyr::group_map over any other package.
[conflicted] Will prefer dplyr::group_modify over any other package.
[conflicted] Will prefer dplyr::group_nest over any other package.
[conflicted] Will prefer dplyr::group_rows over any other package.
[conflicted] Will prefer dplyr::group_size over any other package.
[conflicted] Will prefer dplyr::group_split over any other package.
[conflicted] Will prefer dplyr::group_trim over any other package.
[conflicted] Will prefer dplyr::group_vars over any other package.
[conflicted] Will prefer dplyr::group_walk over any other package.
[conflicted] Will prefer dplyr::grouped_df over any other package.
[conflicted] Will prefer dplyr::groups over any other package.
[conflicted] Will prefer dplyr::id over any other package.
[conflicted] Will prefer dplyr::ident over any other package.
[conflicted] Will prefer dplyr::if_all over any other package.
[conflicted] Will prefer dplyr::if_any over any other package.
[conflicted] Will prefer dplyr::if_else over any other package.
[conflicted] Will prefer dplyr::inner_join over any other package.
[conflicted] Will prefer dplyr::intersect over any other package.
[conflicted] Will prefer dplyr::is.grouped_df over any other package.
[conflicted] Will prefer dplyr::is.src over any other package.
[conflicted] Will prefer dplyr::is.tbl over any other package.
[conflicted] Will prefer dplyr::is_grouped_df over any other package.
[conflicted] Will prefer dplyr::join_by over any other package.
[conflicted] Will prefer dplyr::lag over any other package.
[conflicted] Will prefer dplyr::last over any other package.
[conflicted] Will prefer dplyr::last_col over any other package.
[conflicted] Will prefer dplyr::last_dplyr_warnings over any other package.
[conflicted] Will prefer dplyr::lead over any other package.
[conflicted] Will prefer dplyr::left_join over any other package.
[conflicted] Will prefer dplyr::location over any other package.
[conflicted] Will prefer dplyr::lst over any other package.
[conflicted] Will prefer dplyr::make_tbl over any other package.
[conflicted] Will prefer dplyr::matches over any other package.
[conflicted] Will prefer dplyr::min_rank over any other package.
[conflicted] Will prefer dplyr::mutate over any other package.
[conflicted] Will prefer dplyr::mutate_ over any other package.
[conflicted] Will prefer dplyr::mutate_all over any other package.
[conflicted] Will prefer dplyr::mutate_at over any other package.
[conflicted] Will prefer dplyr::mutate_each over any other package.
[conflicted] Will prefer dplyr::mutate_each_ over any other package.
[conflicted] Will prefer dplyr::mutate_if over any other package.
[conflicted] Will prefer dplyr::n over any other package.
[conflicted] Will prefer dplyr::n_distinct over any other package.
[conflicted] Will prefer dplyr::n_groups over any other package.
[conflicted] Will prefer dplyr::na_if over any other package.
[conflicted] Will prefer dplyr::near over any other package.
[conflicted] Will prefer dplyr::nest_by over any other package.
[conflicted] Will prefer dplyr::nest_join over any other package.
[conflicted] Will prefer dplyr::new_grouped_df over any other package.
[conflicted] Will prefer dplyr::new_rowwise_df over any other package.
[conflicted] Will prefer dplyr::nth over any other package.
[conflicted] Will prefer dplyr::ntile over any other package.
[conflicted] Will prefer dplyr::num_range over any other package.
[conflicted] Will prefer dplyr::one_of over any other package.
[conflicted] Will prefer dplyr::order_by over any other package.
[conflicted] Will prefer dplyr::percent_rank over any other package.
[conflicted] Will prefer dplyr::pick over any other package.
[conflicted] Will prefer dplyr::progress_estimated over any other package.
[conflicted] Will prefer dplyr::pull over any other package.
[conflicted] Will prefer dplyr::quo over any other package.
[conflicted] Will prefer dplyr::quo_name over any other package.
[conflicted] Will prefer dplyr::quos over any other package.
[conflicted] Will prefer dplyr::recode over any other package.
[conflicted] Will prefer dplyr::recode_factor over any other package.
[conflicted] Will prefer dplyr::reframe over any other package.
[conflicted] Will prefer dplyr::relocate over any other package.
[conflicted] Will prefer dplyr::rename over any other package.
[conflicted] Will prefer dplyr::rename_ over any other package.
[conflicted] Will prefer dplyr::rename_all over any other package.
[conflicted] Will prefer dplyr::rename_at over any other package.
[conflicted] Will prefer dplyr::rename_if over any other package.
[conflicted] Will prefer dplyr::rename_vars over any other package.
[conflicted] Will prefer dplyr::rename_vars_ over any other package.
[conflicted] Will prefer dplyr::rename_with over any other package.
[conflicted] Will prefer dplyr::right_join over any other package.
[conflicted] Will prefer dplyr::row_number over any other package.
[conflicted] Will prefer dplyr::rows_append over any other package.
[conflicted] Will prefer dplyr::rows_delete over any other package.
[conflicted] Will prefer dplyr::rows_insert over any other package.
[conflicted] Will prefer dplyr::rows_patch over any other package.
[conflicted] Will prefer dplyr::rows_update over any other package.
[conflicted] Will prefer dplyr::rows_upsert over any other package.
[conflicted] Will prefer dplyr::rowwise over any other package.
[conflicted] Will prefer dplyr::same_src over any other package.
[conflicted] Will prefer dplyr::sample_frac over any other package.
[conflicted] Will prefer dplyr::sample_n over any other package.
[conflicted] Will prefer dplyr::select over any other package.
[conflicted] Will prefer dplyr::select_ over any other package.
[conflicted] Will prefer dplyr::select_all over any other package.
[conflicted] Will prefer dplyr::select_at over any other package.
[conflicted] Will prefer dplyr::select_if over any other package.
[conflicted] Will prefer dplyr::select_var over any other package.
[conflicted] Will prefer dplyr::select_vars over any other package.
[conflicted] Will prefer dplyr::select_vars_ over any other package.
[conflicted] Will prefer dplyr::semi_join over any other package.
[conflicted] Will prefer dplyr::setdiff over any other package.
[conflicted] Will prefer dplyr::setequal over any other package.
[conflicted] Will prefer dplyr::show_query over any other package.
[conflicted] Will prefer dplyr::slice over any other package.
[conflicted] Will prefer dplyr::slice_ over any other package.
[conflicted] Will prefer dplyr::slice_head over any other package.
[conflicted] Will prefer dplyr::slice_max over any other package.
[conflicted] Will prefer dplyr::slice_min over any other package.
[conflicted] Will prefer dplyr::slice_sample over any other package.
[conflicted] Will prefer dplyr::slice_tail over any other package.
[conflicted] Will prefer dplyr::sql over any other package.
[conflicted] Will prefer dplyr::sql_escape_ident over any other package.
[conflicted] Will prefer dplyr::sql_escape_string over any other package.
[conflicted] Will prefer dplyr::sql_join over any other package.
[conflicted] Will prefer dplyr::sql_select over any other package.
[conflicted] Will prefer dplyr::sql_semi_join over any other package.
[conflicted] Will prefer dplyr::sql_set_op over any other package.
[conflicted] Will prefer dplyr::sql_subquery over any other package.
[conflicted] Will prefer dplyr::sql_translate_env over any other package.
[conflicted] Will prefer dplyr::src over any other package.
[conflicted] Will prefer dplyr::src_df over any other package.
[conflicted] Will prefer dplyr::src_local over any other package.
[conflicted] Will prefer dplyr::src_mysql over any other package.
[conflicted] Will prefer dplyr::src_postgres over any other package.
[conflicted] Will prefer dplyr::src_sqlite over any other package.
[conflicted] Will prefer dplyr::src_tbls over any other package.
[conflicted] Will prefer dplyr::starts_with over any other package.
[conflicted] Will prefer dplyr::starwars over any other package.
[conflicted] Will prefer dplyr::storms over any other package.
[conflicted] Will prefer dplyr::summarise over any other package.
[conflicted] Will prefer dplyr::summarise_ over any other package.
[conflicted] Will prefer dplyr::summarise_all over any other package.
[conflicted] Will prefer dplyr::summarise_at over any other package.
[conflicted] Will prefer dplyr::summarise_each over any other package.
[conflicted] Will prefer dplyr::summarise_each_ over any other package.
[conflicted] Will prefer dplyr::summarise_if over any other package.
[conflicted] Will prefer dplyr::summarize over any other package.
[conflicted] Will prefer dplyr::summarize_ over any other package.
[conflicted] Will prefer dplyr::summarize_all over any other package.
[conflicted] Will prefer dplyr::summarize_at over any other package.
[conflicted] Will prefer dplyr::summarize_each over any other package.
[conflicted] Will prefer dplyr::summarize_each_ over any other package.
[conflicted] Will prefer dplyr::summarize_if over any other package.
[conflicted] Will prefer dplyr::sym over any other package.
[conflicted] Will prefer dplyr::symdiff over any other package.
[conflicted] Will prefer dplyr::syms over any other package.
[conflicted] Will prefer dplyr::tally over any other package.
[conflicted] Will prefer dplyr::tally_ over any other package.
[conflicted] Will prefer dplyr::tbl over any other package.
[conflicted] Will prefer dplyr::tbl_df over any other package.
[conflicted] Will prefer dplyr::tbl_nongroup_vars over any other package.
[conflicted] Will prefer dplyr::tbl_ptype over any other package.
[conflicted] Will prefer dplyr::tbl_vars over any other package.
[conflicted] Will prefer dplyr::tibble over any other package.
[conflicted] Will prefer dplyr::top_frac over any other package.
[conflicted] Will prefer dplyr::top_n over any other package.
[conflicted] Will prefer dplyr::transmute over any other package.
[conflicted] Will prefer dplyr::transmute_ over any other package.
[conflicted] Will prefer dplyr::transmute_all over any other package.
[conflicted] Will prefer dplyr::transmute_at over any other package.
[conflicted] Will prefer dplyr::transmute_if over any other package.
[conflicted] Will prefer dplyr::tribble over any other package.
[conflicted] Will prefer dplyr::type_sum over any other package.
[conflicted] Will prefer dplyr::ungroup over any other package.
[conflicted] Will prefer dplyr::union over any other package.
[conflicted] Will prefer dplyr::union_all over any other package.
[conflicted] Will prefer dplyr::validate_grouped_df over any other package.
[conflicted] Will prefer dplyr::validate_rowwise_df over any other package.
[conflicted] Will prefer dplyr::vars over any other package.
[conflicted] Will prefer dplyr::where over any other package.
[conflicted] Will prefer dplyr::with_groups over any other package.
[conflicted] Will prefer dplyr::with_order over any other package.
[conflicted] Will prefer dplyr::wrap_dbplyr_obj over any other package.

Data preparetion

Code
set.seed(1)

data_total|> 
    mutate(strata_var = interaction(gender, class_age)) -> data_model

data_split = initial_split(data_model, prop = .8, strata = strata_var)

data_train = training(data_split)
data_test = testing(data_split)

Models

Set the formula to use

Code
formula_glm <- total_donations ~ gender + class_age 
Code
augment_glm <- function(fit_wf, data) {
  m <- extract_fit_engine(fit_wf)         # oggetto 'glm'
  tib <- broom::augment(
    m,
    data = data,
    type.predict = "response",             # fitted μ
    type.residuals = "deviance"            # aggiunge .resid
  ) |>
    mutate(
      .mu      = fitted(m),
      .r_pear  = residuals(m, type = "pearson"),
      .r_dev   = residuals(m, type = "deviance"),
      .hat     = hatvalues(m),
      .cooksd  = cooks.distance(m),
      .std_dev = .r_dev / sqrt(pmax(1e-12, 1 - .hat))
    )
  tib
}

dispersion_phi <- function(r_pear, df_resid) {
  sum(r_pear^2, na.rm = TRUE) / df_resid
}

calib_by_decile <- function(df, y = total_donations, mu = .mu, n_bins = 10) {
  df |>
    mutate(.bin = ntile({{ mu }}, n_bins)) |>
    summarise(
      obs_mean = mean({{ y }}, na.rm = TRUE),
      pred_mean = mean({{ mu }}, na.rm = TRUE),
      .by = .bin
    )
}

Poisson

Code
spec_pois <- poisson_reg(mode = "regression") |>
  set_engine("glm", family = poisson(link = "log"))

wf_pois <- workflow() |>
  add_model(spec_pois) |>
  add_formula(formula_glm)

fit_pois_wf <- wf_pois|> fit(data = data_train)
Code
tidy(fit_pois_wf)
Code
tidy(fit_pois_wf) |> 
  dwplot(dot_args = list(size = 2, color = "black"),
        show_intercept = T,
         whisker_args = list(color = "black"),
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))
Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
ℹ Please use the `linewidth` aesthetic instead.
ℹ The deprecated feature was likely used in the dotwhisker package.
  Please report the issue at <https://github.com/fsolt/dotwhisker/issues>.
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
aug_pois_tr <- augment_glm(fit_pois_wf, data_train)

phi_pois <- dispersion_phi(
  r_pear   = aug_pois_tr$.r_pear,
  df_resid = extract_fit_engine(fit_pois_wf)$df.residual
)
phi_pois
[1] 5.732167
Code
# Residui devianza vs μ̂ (pattern → non linearità/eteroschedasticità)
saveRDS(aug_pois_tr, here("R/aug_pois_tr.rds"))
aug_pois_tr |>
  ggplot(aes(.mu, .r_dev)) +
  geom_point(alpha = .25) +
  geom_smooth(se = FALSE, color = state_cols[1]) +
  labs(x = expression(hat(mu)), y = "Residui di devianza",
       title = "Poisson — Residui di devianza vs fitted")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
# QQ-plot residui di devianza standardizzati (asimmetria/code)
aug_pois_tr |>
  ggplot(aes(sample = .std_dev)) +
  stat_qq(alpha = .6) +
  stat_qq_line(color = state_cols[1]) +
  labs(title = "Quasi-Poisson — QQ plot residui di devianza standardizzati", x="", y="") -> p
ggsave(here("thesis/img/glm/quasipoisson_qqplot.png"), plot = p, width = 5, height = 5, dpi = 300)
p
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
# Punti influenti (Cook's distance)
aug_pois_tr |>
  slice_max(.cooksd, n = 20) |>
  mutate(.row = row_number()) |>
  ggplot(aes(x = reorder(as.factor(.row), .cooksd), y = .cooksd)) +
  geom_col(fill = "grey40") +
  coord_flip() +
  labs(x = "Osservazioni (top 20)", y = "Cook's distance",
       title = "Poisson — Punti influenti")
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
# Calibrazione per decili di μ̂
calib_pois <- calib_by_decile(aug_pois_tr, y = total_donations, mu = .mu, n_bins = 10)

calib_pois |>
  pivot_longer(c(obs_mean, pred_mean), names_to = "serie", values_to = "val") |>
  ggplot(aes(.bin, val, color = serie, group = serie)) +
  geom_line(size = 1) +
  geom_point() +
  labs(x = "Decile di μ̂", y = "Media", color = "",
       title = "Poisson — Calibrazione per decile di fitted")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Quasi-Poisson

Code
spec_qpois <- poisson_reg(mode = "regression") |>
  set_engine("glm", family = quasipoisson(link = "log"))

wf_qpois <- workflow() |>
  add_model(spec_qpois) |>
  add_formula(formula_glm)

fit_qpois_wf <- wf_qpois |> fit(data = data_train)
Code
# tidy(fit_qpois_wf)
extract_fit_engine(fit_qpois_wf)

Call:  stats::glm(formula = ..y ~ ., family = ~quasipoisson(link = "log"), 
    data = data)

Coefficients:
       (Intercept)             genderM  `class_age(25,35]`  `class_age(35,45]`  
            0.6043              0.5276              0.5584              0.8959  
`class_age(45,55]`  `class_age(55,65]`  `class_age(65,70]`  
            1.2512              1.4799              1.6433  

Degrees of Freedom: 27561 Total (i.e. Null);  27555 Residual
Null Deviance:      192200 
Residual Deviance: 135400   AIC: NA
Code
fit_qpois_wf |> saveRDS(here("R/fit_qpois_wf.rds"))
Code
tidy(fit_pois_wf) |> 
  dwplot(
    dot_args = list(size = 2, color = "black"),
    show_intercept = T,
    whisker_args = list(color = "black"),
    vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2)
    ) -> p
ggsave(here("thesis/img/glm/quasipoisson_coeff.png"), plot = p, width = 6, height = 4, dpi = 300)
p
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
aug_qpois_tr <- augment_glm(fit_qpois_wf, data_train)

phi_qpois <- dispersion_phi(
  r_pear   = aug_qpois_tr$.r_pear,
  df_resid = extract_fit_engine(fit_qpois_wf)$df.residual
)
phi_qpois
[1] 5.732167
Code
aug_qpois_tr |>
  ggplot(aes(.mu, .r_dev)) +
  geom_point(alpha = .25) +
  geom_smooth(se = FALSE, color = state_cols[1]) +
  labs(x = expression(hat(mu)), y = "Residui di devianza",
       title = "Quasi-Poisson — Residui di devianza vs fitted")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Tweedie

Code
library(statmod)
library(tweedie)
library(doParallel)

numCores <- detectCores() - 1
cl <- makeCluster(numCores)
registerDoParallel(cl)

tweedie_tuning <- foreach(p = seq(1, 3, length.out = numCores*2), .combine =   , .packages = 'tweedie') %dopar% {
  tweedie.profile(
    total_donations ~ class_age + gender,
    data = data_train,
    p.vec = p,
    do.plot = FALSE
  )
}

stopCluster(cl)
registerDoSEQ()

tweedie_tuning_unlisted <- tweedie_tuning |> 
  unlist(use.names = TRUE)

tweedie_tuning_df <- tibble(
  name = names(tweedie_tuning_unlisted),
  value = tweedie_tuning_unlisted
)
saveRDS(tweedie_tuning_df, here::here("R/tweedie_tuning_df.rds"))
Code
tweedie_tuning_df <- readRDS(here::here("R/tweedie_tuning_df.rds"))
tweedie_tuning_wide <- tweedie_tuning_df |> 
  group_by(name) |> 
  mutate(row = row_number()) |> 
  pivot_wider(names_from = name, values_from = value) |> 
  dplyr::select(-row)

tweedie_tuning_wide |>
  mutate(across(c(p, phi, L.max), as.numeric)) |> 
  slice_max(L.max) |> 
  pull(p) -> best_p
Code
tweedie_tuning_wide |> 
  mutate(across(c(p, phi, L.max), as.numeric)) |> 
  ggplot(aes(x = p, y = phi)) +
  geom_point() +
  geom_point(
    data = tweedie_tuning_wide |>
      mutate(across(c(p, phi), as.numeric)) |> 
      slice_max(phi),
    aes(x = p, y = phi),
    color = state_cols[3]
      ) +
  labs(title = "Devianza del modello al variare del parametro potenza") +
  geom_line(linewidth = .2)
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
spec_tweedie <- poisson_reg(mode = "regression") |>
  set_engine("glm")|> 
  set_args(family = statmod::tweedie(var.power = best_p, link.power = 0))

wf_tweedie <- workflow() |>
  add_model(spec_tweedie) |>
  add_formula(formula_glm)

fit_tweedie_wf <- wf_tweedie |> fit(data = data_train)
Code
# tidy(fit_qpois_wf)
extract_fit_engine(fit_tweedie_wf)

Call:  stats::glm(formula = ..y ~ ., family = ~statmod::tweedie(var.power = best_p, 
    link.power = 0), data = data)

Coefficients:
       (Intercept)             genderM  `class_age(25,35]`  `class_age(35,45]`  
            0.7315              0.3694              0.5500              0.8400  
`class_age(45,55]`  `class_age(55,65]`  `class_age(65,70]`  
            1.1928              1.4380              1.6808  

Degrees of Freedom: 27561 Total (i.e. Null);  27555 Residual
Null Deviance:      8408 
Residual Deviance: 6638     AIC: NA
Code
tidy(fit_tweedie_wf) |> 
  dwplot(
    dot_args = list(size = 2, color = "black"),
    show_intercept = T,
    whisker_args = list(color = "black"),
    vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2)
    )->p
ggsave(here("thesis/img/glm/tw_coeff.png"), plot = p, width = 6, height = 4, dpi = 300)
p
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
aug_tw_tr <- augment_glm(fit_tweedie_wf, data_train)

phi_tw <- dispersion_phi(
  r_pear   = aug_tw_tr$.r_pear,
  df_resid = extract_fit_engine(fit_tweedie_wf)$df.residual
)
phi_tw
[1] 0.2050554
Code
aug_tw_tr |>
  ggplot(aes(.mu, .r_dev)) +
  geom_point(alpha = .25) +
  geom_smooth(se = FALSE, color = state_cols[1]) +
  labs(x = expression(hat(mu)), y = "Residui di devianza",
       title = "Tweedie — Residui di devianza vs fitted")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
calib_tw <- calib_by_decile(aug_tw_tr, y = total_donations, mu = .mu, n_bins = 10)

calib_tw |>
  pivot_longer(c(obs_mean, pred_mean), names_to = "serie", values_to = "val") |>
  ggplot(aes(.bin, val, color = serie, group = serie)) +
  geom_line(size = 1) +
  geom_point() +
  theme(legend.position = "bottom") +
  labs(x = "Decile di μ̂", y = "Media", color = "",
       title = "Tweedie — Calibrazione per decile di fitted") -> p
ggsave(here("thesis/img/glm/tweedie_decile_mu.png"), plot = p, width = 6, height = 4, dpi = 300)
p
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Gamma

Code
spec_gamma <- linear_reg(mode = "regression") |>
  set_engine("glm")|> 
  set_args(family = Gamma(link = "inverse"))

wf_gamma <- workflow() |>
  add_model(spec_gamma) |>
  add_formula(formula_glm)

fit_gamma_wf <- wf_gamma |> fit(data = data_train)
Code
# tidy(fit_qpois_wf)
extract_fit_engine(fit_gamma_wf)

Call:  stats::glm(formula = ..y ~ ., family = ~Gamma(link = "inverse"), 
    data = data)

Coefficients:
       (Intercept)             genderM  `class_age(25,35]`  `class_age(35,45]`  
           0.44758            -0.08217            -0.17117            -0.23234  
`class_age(45,55]`  `class_age(55,65]`  `class_age(65,70]`  
          -0.27550            -0.29566            -0.30643  

Degrees of Freedom: 27561 Total (i.e. Null);  27555 Residual
Null Deviance:      32990 
Residual Deviance: 23350    AIC: 145300
Code
tidy(fit_gamma_wf) |> 
  dwplot(
    dot_args = list(size = 2, color = "black"),
    show_intercept = T,
    whisker_args = list(color = "black"),
    vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2)
    ) -> p
ggsave(here("thesis/img/glm/gamma_coeff.png"), plot = p, width = 6, height = 4, dpi = 300)
p
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
aug_gamma_tr <- augment_glm(fit_gamma_wf, data_train)
Code
aug_gamma_tr |>
  ggplot(aes(.mu, .r_dev)) +
  geom_point(alpha = .25) +
  geom_smooth(se = FALSE, color = state_cols[1]) +
  labs(x = expression(hat(mu)), y = "Residui di devianza",
title = "Tweedie — Residui di devianza vs fitted") ->p
ggsave(here("thesis/img/glm/gamma_res.png"), plot = p, width = 6, height = 4, dpi = 300)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Code
p
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
calib_gamma <- calib_by_decile(aug_gamma_tr, y = total_donations, mu = .mu, n_bins = 10)

calib_gamma |>
  pivot_longer(c(obs_mean, pred_mean), names_to = "serie", values_to = "val") |>
  ggplot(aes(.bin, val, color = serie, group = serie)) +
  geom_line(size = 1) +
  theme(legend.position = "bottom") +
  geom_point() +
  labs(x = "Decile di μ̂", y = "Media", color = "",
       title = "Tweedie — Calibrazione per decile di fitted")
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Inverse Gaussian (GLM, link = log)

Code
# Fit con glm (family = inverse.gaussian, link = log)
fit_igauss <- glm(
  formula = formula_glm,
  data    = data_train,
  family  = inverse.gaussian(link = "log")
)

summary(fit_igauss)

Call:
glm(formula = formula_glm, family = inverse.gaussian(link = "log"), 
    data = data_train)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.731522   0.009621   76.03   <2e-16 ***
genderM          0.369377   0.011599   31.85   <2e-16 ***
class_age(25,35] 0.550043   0.014587   37.71   <2e-16 ***
class_age(35,45] 0.839965   0.018025   46.60   <2e-16 ***
class_age(45,55] 1.192842   0.020125   59.27   <2e-16 ***
class_age(55,65] 1.437982   0.027544   52.21   <2e-16 ***
class_age(65,70] 1.680819   0.107611   15.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for inverse.gaussian family taken to be 0.2050562)

    Null deviance: 8407.5  on 27561  degrees of freedom
Residual deviance: 6637.6  on 27555  degrees of freedom
AIC: 140669

Number of Fisher Scoring iterations: 10
Code
saveRDS(fit_igauss, here("R/fit_igauss.rds"))
Code
# Coefficienti e plot stile dotwhisker
library(broom)

tidy(fit_igauss, conf.int = TRUE) |>
  dwplot(
    dot_args = list(size = 2, color = "black"),
    show_intercept = TRUE,
    whisker_args = list(color = "black"),
    vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2)
  ) -> p_igauss

ggsave(here("thesis/img/glm/inverse_gaussian_coeff.png"),
       plot = p_igauss, width = 6, height = 4, dpi = 300)
p_igauss
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
# Residui e diagnostica rapida
aug_igauss <- broom::augment(
  fit_igauss,
  type.predict   = "response",
  type.residuals = "deviance"
) |>
  dplyr::rename(.mu = .fitted, .r_dev = .resid)

phi_igauss <- summary(fit_igauss)$dispersion
phi_igauss
[1] 0.2050562
Code
aug_igauss |>
  ggplot(aes(.mu, .r_dev)) +
  geom_point(alpha = .25) +
  geom_smooth(se = FALSE, color = state_cols[1]) +
  labs(x = expression(hat(mu)), y = "Residui di devianza",
       title = "Inverse Gaussian — Residui di devianza vs fitted")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Poisson troncata a zero (glmmTMB)

Code
# Fit (attenzione: y > 0 obbligatorio)
data_train_pos <- dplyr::filter(data_train, total_donations > 0)

# Specifica "tidymodels-like": formula già definita in `formula_glm`
# Nota: la famiglia troncata è gestita da glmmTMB (non esiste family equivalente in glm)
fit_tpoisson <- glmmTMB(
  formula = formula_glm,                     # es. total_donations ~ gender + class_age
  data    = data_train_pos,
  family  = glmmTMB::truncated_poisson(link = "log")
)

# Ispezione e salvataggio
summary(fit_tpoisson)
 Family: truncated_poisson  ( log )
Formula:          total_donations ~ gender + class_age
Data: data_train_pos

      AIC       BIC    logLik -2*log(L)  df.resid 
 219900.6  219958.2 -109943.3  219886.6     27555 


Conditional model:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      0.458187   0.009444   48.51   <2e-16 ***
genderM          0.561213   0.005886   95.35   <2e-16 ***
class_age(25,35] 0.663710   0.010480   63.33   <2e-16 ***
class_age(35,45] 1.012828   0.010229   99.01   <2e-16 ***
class_age(45,55] 1.370708   0.009640  142.18   <2e-16 ***
class_age(55,65] 1.599751   0.009928  161.13   <2e-16 ***
class_age(65,70] 1.761972   0.018650   94.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
saveRDS(fit_tpoisson, here("R/fit_tpoisson.rds"))
Code
# Coefficienti e forest plot (dotwhisker)
tidy(fit_tpoisson, conf.int = TRUE, effects = "fixed") |>
  dwplot(
    dot_args = list(size = 2, color = "black"),
    show_intercept = TRUE,
    whisker_args = list(color = "black"),
    vline = ggplot2::geom_vline(xintercept = 0, colour = "grey50", linetype = 2)
  ) -> p_tpoisson

ggsave(here("thesis/img/glm/truncated_poisson_coeff.png"),
       plot = p_tpoisson, width = 6, height = 4, dpi = 300)
p_tpoisson
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
# Residui e diagnostica rapida
aug_tpoisson <- tibble::tibble(
  .mu    = fitted(fit_tpoisson, type = "response"),
  .r_dev = data_train$total_donations - .mu
)

aug_tpoisson |>
  ggplot(aes(.mu, .r_dev)) +
  geom_point(alpha = .25) +
  geom_smooth(se = T, color = state_cols[1]) +
  scale_x_continuous(breaks = 1:16) +
  labs(x = expression(hat(mu)), y = "Residui di devianza",
       title = "Zero-truncated Poisson — Residui di devianza vs fitted")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database


Binomiale Negativa (MASS::glm.nb, link = log)

Code
# Fit NB (overdispersione rispetto a Poisson)
fit_nb <- MASS::glm.nb(
  formula = formula_glm,    # es. total_donations ~ gender + class_age
  data    = data_train,
  link    = log
)

# Ispezione e salvataggio
summary(fit_nb)

Call:
MASS::glm.nb(formula = formula_glm, data = data_train, link = log, 
    init.theta = 1.611423434)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       0.65674    0.01340   49.01   <2e-16 ***
genderM           0.47348    0.01156   40.96   <2e-16 ***
class_age(25,35]  0.55589    0.01647   33.76   <2e-16 ***
class_age(35,45]  0.87031    0.01726   50.41   <2e-16 ***
class_age(45,55]  1.22302    0.01661   73.61   <2e-16 ***
class_age(55,65]  1.45951    0.01880   77.65   <2e-16 ***
class_age(65,70]  1.66284    0.05330   31.20   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.6114) family taken to be 1)

    Null deviance: 38868  on 27561  degrees of freedom
Residual deviance: 27031  on 27555  degrees of freedom
AIC: 150028

Number of Fisher Scoring iterations: 1

              Theta:  1.6114 
          Std. Err.:  0.0169 

 2 x log-likelihood:  -150012.3430 
Code
fit_nb$theta  # parametro di dispersione (più alto -> meno overdispersione)
[1] 1.611423
Code
saveRDS(fit_nb, here("R/fit_nb.rds"))
Code
# Coefficienti e forest plot (dotwhisker)
tidy(fit_nb, conf.int = TRUE) |>
  dwplot(
    dot_args = list(size = 2, color = "black"),
    show_intercept = TRUE,
    whisker_args = list(color = "black"),
    vline = ggplot2::geom_vline(xintercept = 0, colour = "grey50", linetype = 2)
  ) -> p_nb

ggsave(here("thesis/img/glm/negbin_coeff.png"),
       plot = p_nb, width = 6, height = 4, dpi = 300)
p_nb
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
# Residui e diagnostica rapida
aug_nb <- broom::augment(
  fit_nb,
  type.predict   = "response",
  type.residuals = "deviance"
) |>
  dplyr::rename(.mu = .fitted, .r_dev = .resid)
Warning: The `augment()` method for objects of class `negbin` is not maintained by the broom team, and is only supported through the `glm` tidier method. Please be cautious in interpreting and reporting broom output.

This warning is displayed once per session.
Code
saveRDS(aug_nb, here("R/aug_nb.rds"))

aug_nb |>
  ggplot(aes(.mu, .r_dev)) +
  geom_point(alpha = .25) +
  geom_smooth(se = FALSE, color = state_cols[1]) +
  labs(x = expression(hat(mu)), y = "Residui di devianza",
       title = "Negative Binomial — Residui di devianza vs fitted")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Confronto capacità predittiva (test set)

Valutiamo errori su test per ciascun GLM e una devianza di Poisson “pseudo” (informativa anche off-family).

Code
# Predizioni su test
pred_tpois_te <- tibble(.pred = predict(fit_tpoisson, newdata = data_test, type = "response")) |> 
  mutate(model = "Truncated Poisson")
pred_nb_te <- tibble(.pred = predict(fit_nb, newdata = data_test, type = "response")) |> 
  mutate(model = "Negative Binomial")
pred_pois_te <- predict(fit_pois_wf,    new_data = data_test) |> mutate(model = "Poisson")
pred_qpois_te<- predict(fit_qpois_wf,   new_data = data_test) |> mutate(model = "Quasi-Poisson")
pred_tw_te   <- predict(fit_tweedie_wf, new_data = data_test) |> mutate(model = "Tweedie")
pred_gamma_te   <- predict(fit_gamma_wf, new_data = data_test) |> mutate(model = "Gamma")

pred_te <- bind_rows(pred_pois_te, pred_qpois_te, pred_tw_te, pred_gamma_te, pred_tpois_te, pred_nb_te) |>
  bind_cols(
    y =
    data_test |> 
    pull(total_donations) |> 
    rep(6)
    ) |> 
  rename(mu = .pred)

# Poisson deviance totale (convenz. 0·log(0/μ)=0)
poisson_deviance <- function(y, mu) {
  y <- pmax(y, 0); mu <- pmax(mu, 1e-10)
  2 * sum(ifelse(y == 0, 0, y * log(y / mu)) - (y - mu))
}

perf_te <- pred_te |>
  group_by(model) |>
  summarise(
    rmse  = yardstick::rmse_vec(truth = y, estimate = mu),
    mae   = yardstick::mae_vec(truth = y, estimate = mu),
    rsq   = yardstick::rsq_vec(truth = y, estimate = mu),
    # dev_p = poisson_deviance(y, mu),
    .groups = "drop"
  )
saveRDS(perf_te, here("R/model_performance.rds"))
perf_te 
Code
perf_te |>
  pivot_longer(-model) |>
  ggplot(aes(model, value, fill = model)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ name, scales = "free_y") +
  labs(x = "", y = "", title = "Confronto metriche su test (GLM)")
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Lasso model

Code
formula_glmnet <- total_donations ~ gender + class_age * class_year + class_year * first_donation_year 
Code
spec_glmnet <- linear_reg(
  penalty = tune(),
  mixture = tune()
  )|> 
  set_engine("glmnet")

data_grid <- grid_regular(penalty(c(5, 0)),
                          mixture(c(0, 1)),
                          levels = 30)

set.seed(1)
data_folds <- vfold_cv(data_train, v = 10)

wf_glmnet <- workflow() |>
  add_model(spec_glmnet) |>
  add_formula(formula_glmnet)
Code
cores <- detectCores() -1
cl <- makeCluster(cores)
registerDoParallel(cl)

fit_glmnet_wf <- 
  wf_glmnet |> 
  tune_grid(
    resamples =  data_folds,
    grid = data_grid
    )

stopCluster(cl)

saveRDS(fit_glmnet_wf, here("R/fit_glmnet_wf.rds"))
zip("R/fit_glmnet_wf.zip", "R/fit_glmnet_wf.rds")
file.remove(here::here("R/fit_glmnet_wf.rds"))
Code
# unzip(here::here("R/fit_glmnet_wf.zip"))
# fit_glmnet_wf <- readRDS(here::here("R/fit_glmnet_wf.rds"))
fit_glmnet_wf <- readRDS("R/fit_glmnet_wf.rds")
# file.remove(here::here("R/fit_glmnet_wf.rds"))
fit_glmnet_wf|> 
  collect_metrics()|>
  ggplot(aes(mixture, mean, color = penalty, group = penalty)) +
  geom_line(size = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) 
Warning: Removed 737 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 737 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
best_mod <- fit_glmnet_wf|> 
  select_best(metric = "rsq")

final_wf <- finalize_workflow(wf_glmnet, best_mod)
final_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
total_donations ~ gender + class_age * class_year + class_year * 
    first_donation_year

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 1
  mixture = 0

Computational engine: glmnet 
Code
final_fitted <- last_fit(final_wf, split = data_split)

final_fitted|> 
  collect_metrics() 
Code
predictions <- final_fitted|> 
  collect_predictions()|> 
  select(row = .row, total_donations, predicted = .pred)|> 
  mutate(
    delta = total_donations - predicted,
    delta_percent = scales::percent(delta / predicted, accuracy = .1)
  )
predictions
Code
final <- fit(final_wf, data = data_train)

final|> 
  tidy()|> 
  filter(abs(estimate) < 1E2, abs(estimate) > 1E-3)|> 
  ggplot(aes(estimate, term)) +
  geom_col()
Warning: package 'glmnet' was built under R version 4.5.1
Warning: package 'Matrix' was built under R version 4.5.1

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-10
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
# Residui su test dal last_fit (già calcolato)
pred_glmnet_te <- final_fitted |>
  collect_predictions() |>
  transmute(
    model = "glmnet",
    y  = total_donations,
    mu = .pred,
    resid = y - mu
  )

# Pattern residui vs predizione
pred_glmnet_te |>
  ggplot(aes(mu, resid)) +
  geom_point(alpha = .25) +
  geom_smooth(se = FALSE, color = state_cols[1]) +
  labs(x = expression(hat(mu)), y = "Residuo (y - μ̂)",
       title = "GLMNET — Residui vs predizione")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
# Sparsità e importanza (stima finale su training)
final_fit_tr <- fit(final_wf, data = data_train)

# Numero coefficienti non nulli (esclusa intercetta)
coef_tbl <- tidy(final_fit_tr) |>
  filter(term != "(Intercept)")

n_nonzero <- sum(coef_tbl$estimate != 0)
n_nonzero
[1] 31
Code
# Top effetti per magnitudine
coef_tbl |>
  mutate(abs_est = abs(estimate)) |>
  slice_max(abs_est, n = 20) |>
  ggplot(aes(x = abs_est, y = reorder(term, abs_est), fill = estimate > 0)) +
  geom_col(show.legend = FALSE) +
  labs(x = "|stima|", y = NULL, title = "GLMNET — Top 20 coefficienti per magnitudine")
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Code
# Metriche su test (coerenti con i GLM)
perf_glmnet <- pred_glmnet_te |>
  summarise(
    rmse = yardstick::rmse_vec(truth = y, estimate = mu),
    mae  = yardstick::mae_vec(truth = y, estimate = mu),
    rsq  = yardstick::rsq_vec(truth = y, estimate = mu)
  )
perf_glmnet

DE LUCA ERIK, P.IVA: IT01401250327
Sede legale: Via dei Giardini, 50 - 34146 - Trieste

Copyright 2025, Erik De Luca

Cookie Preferences

This website is built with , , and Quarto