Code
import os
import pandas as pd
Data import and joining
Unknown
March 3, 2025
Starting from the xlsx files, let’s convert them to .csv
:
# Setting directories containing data:
input_folder = "../../DATA/XLSX"
output_folder = "../../DATA/CSV"
# Creating output directory if it doesn't exist:
if not os.path.exists(output_folder):
os.makedirs(output_folder)
# Iterating over each file in the input directory:
for filename in os.listdir(input_folder):
if filename.endswith(".xlsx"):
# Creating full path to the xlsx file:
xlsx_path = os.path.join(input_folder, filename)
# Reading the xlsx file:
try:
df = pd.read_excel(xlsx_path, engine="openpyxl")
except Exception as e:
print(f"Could not read {filename}: {e}")
continue
# Changing the filename extension to .csv:
csv_filename = filename.rsplit(".", 1)[0] + ".csv"
csv_path = os.path.join(output_folder, csv_filename)
# Writing the dataframe to a csv file:
try:
df.to_csv(csv_path, index=False)
print(f"Converted {filename} to {csv_filename}")
except Exception as e:
print(f"Could not convert {filename}: {e}")
print("Conversion complete.")
Converted 2009-A riformattato.xlsx to 2009-A riformattato.csv
Converted 2010-A riformattato.xlsx to 2010-A riformattato.csv
Converted 2011-A riformattato.xlsx to 2011-A riformattato.csv
Converted 2012-A riformattato.xlsx to 2012-A riformattato.csv
Converted 2013-A riformattato.xlsx to 2013-A riformattato.csv
Converted 2014-A riformattato.xlsx to 2014-A riformattato.csv
Converted 2015-A riformattato.xlsx to 2015-A riformattato.csv
Converted 2016-A riformattato.xlsx to 2016-A riformattato.csv
Converted 2017-A riformattato.xlsx to 2017-A riformattato.csv
Converted 2018-A riformattato.xlsx to 2018-A riformattato.csv
Converted 2019-A riformattato.xlsx to 2019-A riformattato.csv
Converted 2020-A riformattato.xlsx to 2020-A riformattato.csv
Converted 2021-A riformattato.xlsx to 2021-A riformattato.csv
Converted 2022-A riformattato.xlsx to 2022-A riformattato.csv
Converted 2023-A riformattato.xlsx to 2023-A riformattato.csv
Conversion complete.
Now it’s possible to save the filenames in a global variable to use them when needed with the right extension; the directory containing the files will be saved too:
file_paths = [
"../../DATA/CSV/2009-A riformattato.csv",
"../../DATA/CSV/2010-A riformattato.csv",
"../../DATA/CSV/2011-A riformattato.csv",
"../../DATA/CSV/2012-A riformattato.csv",
"../../DATA/CSV/2013-A riformattato.csv",
"../../DATA/CSV/2014-A riformattato.csv",
"../../DATA/CSV/2015-A riformattato.csv",
"../../DATA/CSV/2016-A riformattato.csv",
"../../DATA/CSV/2017-A riformattato.csv",
"../../DATA/CSV/2018-A riformattato.csv",
"../../DATA/CSV/2019-A riformattato.csv",
"../../DATA/CSV/2020-A riformattato.csv",
"../../DATA/CSV/2021-A riformattato.csv",
"../../DATA/CSV/2022-A riformattato.csv",
"../../DATA/CSV/2023-A riformattato.csv",
]
dir = "../../DATA/CSV"
Now, it’s important to uniform the names of the columns across files:
import pandas as pd
# Defining uniform column names:
uniform_column_mapping = {
"classe_donatore": "donor_class",
"tipo": "donation_type",
"nascita": "birth_year",
"coorte_nascita": "birth_cohort",
"prima_donazione": "first_donation_year",
"coorte_prima_donazione": "first_donation_cohort",
"numero_donazioni": "number_of_donations",
"sesso": "gender",
"data": "year",
"ETÀ": "age",
"ETA'": "age",
"UN": "unique_number",
}
cleaned_files = []
for file_path in file_paths:
# Loading the data:
data = pd.read_csv(file_path)
# Dropping the first column regardless of its name:
data.drop(data.columns[0], axis=1, inplace=True)
# Renaming columns to uniform names:
data.rename(columns=uniform_column_mapping, inplace=True)
# Standardizing data types - ensuring numeric columns are in the right data format:
data["birth_year"] = data["birth_year"].astype(int, errors="ignore")
data["birth_cohort"] = data["birth_cohort"].astype(int, errors="ignore")
data["first_donation_year"] = data["first_donation_year"].astype(
int, errors="ignore"
)
data["first_donation_cohort"] = data["first_donation_cohort"].astype(
int, errors="ignore"
)
data["number_of_donations"] = data["number_of_donations"].astype(
int, errors="ignore"
)
# Calculating missing 'age' data if possible:
data["age"] = data.apply(
lambda row: (
row["year"] - row["birth_year"]
if pd.isna(row["age"]) and row["birth_year"] > 0
else row["age"]
),
axis=1,
)
# Saving the cleaned data:
data.to_csv(file_path, index=False)
cleaned_files.append(file_path)
for file in cleaned_files:
print(f"Cleaned and saved: {file}")
Cleaned and saved: ../../DATA/CSV/2009-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2010-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2011-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2012-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2013-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2014-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2015-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2016-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2017-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2018-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2019-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2020-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2021-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2022-A riformattato.csv
Cleaned and saved: ../../DATA/CSV/2023-A riformattato.csv
Now that the files are all cleaned and formatted, it’s possible to create a dataframe with them using pandas.
# Creating a dataframe for each individual file:
dataframes = []
# Iterating over each file in the directory:
for file in file_paths:
if file.endswith(".csv"):
# Creating the dataframe for the current file:
df = pd.read_csv(file)
# Appending the dataframe to the list:
dataframes.append(df)
# Creating the final dataframe by concatenating all the individual dataframes:
combined_dataframe = pd.concat(dataframes, ignore_index=True)
print("Displaying the first few rows of the combined DataFrame:")
print(combined_dataframe.head())
print("\nSummary Information about the Combined DataFrame:")
print(combined_dataframe.info())
print("\nDisplaying a random sample of 10 rows:")
print(combined_dataframe.sample(10))
Displaying the first few rows of the combined DataFrame:
donor_class donation_type birth_year birth_cohort first_donation_year \
0 P PLASMAF NaN NaN 1998.0
1 O SANGUE 1980.0 1980.0 2009.0
2 P PLASMAF 1980.0 1980.0 2005.0
3 P PLASMAF 1980.0 1980.0 2005.0
4 P PLASMAF 1980.0 1980.0 2005.0
first_donation_cohort number_of_donations gender year age \
0 1995.0 1.0 M 2009.0 NaN
1 2005.0 1.0 F 2009.0 29.0
2 2005.0 5.0 M 2009.0 29.0
3 2005.0 5.0 M 2009.0 29.0
4 2005.0 5.0 M 2009.0 29.0
unique_number
0 27128310.0
1 26934057.0
2 26826075.0
3 26826075.0
4 26826075.0
Summary Information about the Combined DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320734 entries, 0 to 320733
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 donor_class 318762 non-null object
1 donation_type 318797 non-null object
2 birth_year 318795 non-null float64
3 birth_cohort 318795 non-null float64
4 first_donation_year 268554 non-null float64
5 first_donation_cohort 268554 non-null float64
6 number_of_donations 318797 non-null float64
7 gender 318797 non-null object
8 year 318797 non-null float64
9 age 319616 non-null float64
10 unique_number 319618 non-null float64
dtypes: float64(8), object(3)
memory usage: 26.9+ MB
None
Displaying a random sample of 10 rows:
donor_class donation_type birth_year birth_cohort \
270899 P PLASMAF 1968.0 1965.0
287670 P PLASMAF 1962.0 1960.0
178699 P SANGUE 1972.0 1970.0
66666 O SANGUE 1963.0 1960.0
110154 P SANGUE 1984.0 1980.0
264094 P PLASMAF 1975.0 1975.0
28006 P SANGUE 1964.0 1960.0
157319 P SANGUE 1968.0 1965.0
93977 P SANGUE 1971.0 1970.0
165119 P PLASMAF 1957.0 1955.0
first_donation_year first_donation_cohort number_of_donations \
270899 2015.0 2015.0 7.0
287670 1994.0 1990.0 3.0
178699 2008.0 2005.0 4.0
66666 2011.0 2010.0 1.0
110154 2004.0 2000.0 1.0
264094 2015.0 2015.0 3.0
28006 NaN NaN 2.0
157319 2001.0 2000.0 1.0
93977 2004.0 2000.0 3.0
165119 1984.0 1980.0 3.0
gender year age unique_number
270899 M 2021.0 53.0 27062223.0
287670 M 2022.0 60.0 26463408.0
178699 M 2016.0 44.0 26901579.0
66666 M 2011.0 48.0 26982792.0
110154 M 2013.0 29.0 26802666.0
264094 M 2020.0 45.0 26642463.0
28006 M 2010.0 46.0 26456958.0
157319 M 2015.0 47.0 26821704.0
93977 M 2012.0 41.0 26807010.0
165119 M 2016.0 59.0 26457501.0
Lines with missing values are usually dropped from the dataset; some values, however, can be calculated indirectly using other variables (e.g. “age”):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320734 entries, 0 to 320733
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 donor_class 318762 non-null object
1 donation_type 318797 non-null object
2 birth_year 318795 non-null float64
3 birth_cohort 318795 non-null float64
4 first_donation_year 268554 non-null float64
5 first_donation_cohort 268554 non-null float64
6 number_of_donations 318797 non-null float64
7 gender 318797 non-null object
8 year 318797 non-null float64
9 age 319616 non-null float64
10 unique_number 319618 non-null float64
dtypes: float64(8), object(3)
memory usage: 26.9+ MB
None
Summary after dropping rows with missing values:
<class 'pandas.core.frame.DataFrame'>
Index: 268530 entries, 1 to 320733
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 donor_class 268530 non-null object
1 donation_type 268530 non-null object
2 birth_year 268530 non-null float64
3 birth_cohort 268530 non-null float64
4 first_donation_year 268530 non-null float64
5 first_donation_cohort 268530 non-null float64
6 number_of_donations 268530 non-null float64
7 gender 268530 non-null object
8 year 268530 non-null float64
9 age 268530 non-null float64
10 unique_number 268530 non-null float64
dtypes: float64(8), object(3)
memory usage: 24.6+ MB
None
Great. Now that the dataframe is ready to be used, it’s possible to save it to a single .csv
file for further analysis.