Preprocessing Data

Data import and joining

Python

This notebook was created by a previous bachelor student who used for his/her thesis

Author

Unknown

Published

March 3, 2025

Data cleaning

Starting from the xlsx files, let’s convert them to .csv:

Imports

Code

import os
import pandas as pd

Code

# Setting directories containing data:
input_folder = "../../DATA/XLSX"
output_folder = "../../DATA/CSV"

# Creating output directory if it doesn't exist:
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Iterating over each file in the input directory:
for filename in os.listdir(input_folder):
    if filename.endswith(".xlsx"):

        # Creating full path to the xlsx file:
        xlsx_path = os.path.join(input_folder, filename)

        # Reading the xlsx file:
        try:
            df = pd.read_excel(xlsx_path, engine="openpyxl")
        except Exception as e:
            print(f"Could not read {filename}: {e}")
            continue

        # Changing the filename extension to .csv:
        csv_filename = filename.rsplit(".", 1)[0] + ".csv"
        csv_path = os.path.join(output_folder, csv_filename)

        # Writing the dataframe to a csv file:
        try:
            df.to_csv(csv_path, index=False)
            print(f"Converted {filename} to {csv_filename}")
        except Exception as e:
            print(f"Could not convert {filename}: {e}")

print("Conversion complete.")

Converted 2009-A riformattato.xlsx to 2009-A riformattato.csv
Converted 2010-A riformattato.xlsx to 2010-A riformattato.csv
Converted 2011-A riformattato.xlsx to 2011-A riformattato.csv
Converted 2012-A riformattato.xlsx to 2012-A riformattato.csv
Converted 2013-A riformattato.xlsx to 2013-A riformattato.csv
Converted 2014-A riformattato.xlsx to 2014-A riformattato.csv
Converted 2015-A riformattato.xlsx to 2015-A riformattato.csv
Converted 2016-A riformattato.xlsx to 2016-A riformattato.csv
Converted 2017-A riformattato.xlsx to 2017-A riformattato.csv
Converted 2018-A riformattato.xlsx to 2018-A riformattato.csv
Converted 2019-A riformattato.xlsx to 2019-A riformattato.csv
Converted 2020-A riformattato.xlsx to 2020-A riformattato.csv
Converted 2021-A riformattato.xlsx to 2021-A riformattato.csv
Converted 2022-A riformattato.xlsx to 2022-A riformattato.csv
Converted 2023-A riformattato.xlsx to 2023-A riformattato.csv
Conversion complete.

Now it’s possible to save the filenames in a global variable to use them when needed with the right extension; the directory containing the files will be saved too:

Code

file_paths = [
    "../../DATA/CSV/2009-A riformattato.csv",
    "../../DATA/CSV/2010-A riformattato.csv",
    "../../DATA/CSV/2011-A riformattato.csv",
    "../../DATA/CSV/2012-A riformattato.csv",
    "../../DATA/CSV/2013-A riformattato.csv",
    "../../DATA/CSV/2014-A riformattato.csv",
    "../../DATA/CSV/2015-A riformattato.csv",
    "../../DATA/CSV/2016-A riformattato.csv",
    "../../DATA/CSV/2017-A riformattato.csv",
    "../../DATA/CSV/2018-A riformattato.csv",
    "../../DATA/CSV/2019-A riformattato.csv",
    "../../DATA/CSV/2020-A riformattato.csv",
    "../../DATA/CSV/2021-A riformattato.csv",
    "../../DATA/CSV/2022-A riformattato.csv",
    "../../DATA/CSV/2023-A riformattato.csv",
]
dir = "../../DATA/CSV"

Now, it’s important to uniform the names of the columns across files:

Code

import pandas as pd

# Defining uniform column names:
uniform_column_mapping = {
    "classe_donatore": "donor_class",
    "tipo": "donation_type",
    "nascita": "birth_year",
    "coorte_nascita": "birth_cohort",
    "prima_donazione": "first_donation_year",
    "coorte_prima_donazione": "first_donation_cohort",
    "numero_donazioni": "number_of_donations",
    "sesso": "gender",
    "data": "year",
    "ETÀ": "age",
    "ETA'": "age",
    "UN": "unique_number",
}

cleaned_files = []

for file_path in file_paths:
    # Loading the data:
    data = pd.read_csv(file_path)

    # Dropping the first column regardless of its name:
    data.drop(data.columns[0], axis=1, inplace=True)

    # Renaming columns to uniform names:
    data.rename(columns=uniform_column_mapping, inplace=True)

    # Standardizing data types - ensuring numeric columns are in the right data format:
    data["birth_year"] = data["birth_year"].astype(int, errors="ignore")
    data["birth_cohort"] = data["birth_cohort"].astype(int, errors="ignore")
    data["first_donation_year"] = data["first_donation_year"].astype(
        int, errors="ignore"
    )
    data["first_donation_cohort"] = data["first_donation_cohort"].astype(
        int, errors="ignore"
    )
    data["number_of_donations"] = data["number_of_donations"].astype(
        int, errors="ignore"
    )

    # Calculating missing 'age' data if possible:
    data["age"] = data.apply(
        lambda row: (
            row["year"] - row["birth_year"]
            if pd.isna(row["age"]) and row["birth_year"] > 0
            else row["age"]
        ),
        axis=1,
    )

    # Saving the cleaned data:
    data.to_csv(file_path, index=False)
    cleaned_files.append(file_path)

for file in cleaned_files:
    print(f"Cleaned and saved: {file}")

Cleaned and saved: ../../DATA/CSV/2009-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2010-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2011-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2012-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2013-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2014-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2015-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2016-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2017-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2018-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2019-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2020-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2021-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2022-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2023-A riformattato.csv

Dataframe creation

Now that the files are all cleaned and formatted, it’s possible to create a dataframe with them using pandas.

Code

# Creating a dataframe for each individual file:
dataframes = []

# Iterating over each file in the directory:
for file in file_paths:
    if file.endswith(".csv"):

        # Creating the dataframe for the current file:
        df = pd.read_csv(file)

        # Appending the dataframe to the list:
        dataframes.append(df)

# Creating the final dataframe by concatenating all the individual dataframes:
combined_dataframe = pd.concat(dataframes, ignore_index=True)

print("Displaying the first few rows of the combined DataFrame:")
print(combined_dataframe.head())

print("\nSummary Information about the Combined DataFrame:")
print(combined_dataframe.info())

print("\nDisplaying a random sample of 10 rows:")
print(combined_dataframe.sample(10))

Displaying the first few rows of the combined DataFrame:
  donor_class donation_type  birth_year  birth_cohort  first_donation_year  \
0           P       PLASMAF         NaN           NaN               1998.0   
1           O        SANGUE      1980.0        1980.0               2009.0   
2           P       PLASMAF      1980.0        1980.0               2005.0   
3           P       PLASMAF      1980.0        1980.0               2005.0   
4           P       PLASMAF      1980.0        1980.0               2005.0   

   first_donation_cohort  number_of_donations gender    year   age  \
0                 1995.0                  1.0      M  2009.0   NaN   
1                 2005.0                  1.0      F  2009.0  29.0   
2                 2005.0                  5.0      M  2009.0  29.0   
3                 2005.0                  5.0      M  2009.0  29.0   
4                 2005.0                  5.0      M  2009.0  29.0   

   unique_number  
0     27128310.0  
1     26934057.0  
2     26826075.0  
3     26826075.0  
4     26826075.0  

Summary Information about the Combined DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320734 entries, 0 to 320733
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   donor_class            318762 non-null  object 
 1   donation_type          318797 non-null  object 
 2   birth_year             318795 non-null  float64
 3   birth_cohort           318795 non-null  float64
 4   first_donation_year    268554 non-null  float64
 5   first_donation_cohort  268554 non-null  float64
 6   number_of_donations    318797 non-null  float64
 7   gender                 318797 non-null  object 
 8   year                   318797 non-null  float64
 9   age                    319616 non-null  float64
 10  unique_number          319618 non-null  float64
dtypes: float64(8), object(3)
memory usage: 26.9+ MB
None

Displaying a random sample of 10 rows:
       donor_class donation_type  birth_year  birth_cohort  \
270899           P       PLASMAF      1968.0        1965.0   
287670           P       PLASMAF      1962.0        1960.0   
178699           P        SANGUE      1972.0        1970.0   
66666            O        SANGUE      1963.0        1960.0   
110154           P        SANGUE      1984.0        1980.0   
264094           P       PLASMAF      1975.0        1975.0   
28006            P        SANGUE      1964.0        1960.0   
157319           P        SANGUE      1968.0        1965.0   
93977            P        SANGUE      1971.0        1970.0   
165119           P       PLASMAF      1957.0        1955.0   

        first_donation_year  first_donation_cohort  number_of_donations  \
270899               2015.0                 2015.0                  7.0   
287670               1994.0                 1990.0                  3.0   
178699               2008.0                 2005.0                  4.0   
66666                2011.0                 2010.0                  1.0   
110154               2004.0                 2000.0                  1.0   
264094               2015.0                 2015.0                  3.0   
28006                   NaN                    NaN                  2.0   
157319               2001.0                 2000.0                  1.0   
93977                2004.0                 2000.0                  3.0   
165119               1984.0                 1980.0                  3.0   

       gender    year   age  unique_number  
270899      M  2021.0  53.0     27062223.0  
287670      M  2022.0  60.0     26463408.0  
178699      M  2016.0  44.0     26901579.0  
66666       M  2011.0  48.0     26982792.0  
110154      M  2013.0  29.0     26802666.0  
264094      M  2020.0  45.0     26642463.0  
28006       M  2010.0  46.0     26456958.0  
157319      M  2015.0  47.0     26821704.0  
93977       M  2012.0  41.0     26807010.0  
165119      M  2016.0  59.0     26457501.0

Dropping NaN values

Lines with missing values are usually dropped from the dataset; some values, however, can be calculated indirectly using other variables (e.g. “age”):

Code

print(combined_dataframe.info())
# Drop rows where any column has NaN values
combined_dataframe_cleaned = combined_dataframe.dropna()

# Display a summary to verify
print("Summary after dropping rows with missing values:")
print(combined_dataframe_cleaned.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320734 entries, 0 to 320733
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   donor_class            318762 non-null  object 
 1   donation_type          318797 non-null  object 
 2   birth_year             318795 non-null  float64
 3   birth_cohort           318795 non-null  float64
 4   first_donation_year    268554 non-null  float64
 5   first_donation_cohort  268554 non-null  float64
 6   number_of_donations    318797 non-null  float64
 7   gender                 318797 non-null  object 
 8   year                   318797 non-null  float64
 9   age                    319616 non-null  float64
 10  unique_number          319618 non-null  float64
dtypes: float64(8), object(3)
memory usage: 26.9+ MB
None
Summary after dropping rows with missing values:
<class 'pandas.core.frame.DataFrame'>
Index: 268530 entries, 1 to 320733
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   donor_class            268530 non-null  object 
 1   donation_type          268530 non-null  object 
 2   birth_year             268530 non-null  float64
 3   birth_cohort           268530 non-null  float64
 4   first_donation_year    268530 non-null  float64
 5   first_donation_cohort  268530 non-null  float64
 6   number_of_donations    268530 non-null  float64
 7   gender                 268530 non-null  object 
 8   year                   268530 non-null  float64
 9   age                    268530 non-null  float64
 10  unique_number          268530 non-null  float64
dtypes: float64(8), object(3)
memory usage: 24.6+ MB
None

Great. Now that the dataframe is ready to be used, it’s possible to save it to a single .csv file for further analysis.

Code

combined_dataframe_cleaned.to_csv("../../DATA/FINAL/dataframe_cleaned.csv", index=False)