Prediction
My model will try set itself to predict whether or not a passenger will purchase travel insurance.
Context
A tour & travels company is offering travel insurance packages to their customers. The new insurance package also includes Covid cover. The company requires to know which customers would be interested to buy it based on its database history. The insurance was offered to some of the customers in 2019 and the given data has been extracted from the performance/sales of the package during that period. The data is provided for almost 2000 of its previous customers and you are required to build an intelligent model that can predict if the customer will be interested to buy the travel insurance package based on certain parameters given below.
Inspiration
The solution offered by you may be used for customer specific advertising of the package. Exploratory data analysis performed on the data would help find interesting insights. Predict whether a given customer would like to buy the insurance package, once the corona lockdown ends and travelling resumes. your work could probably help save thousands of rupees for a family.
- Age - Age of the customer
- Employment Type - The sector in which customer is employed
- GraduateOrNot - Whether the customer is college graduate or not
- AnnualIncome - The yearly income of the customer in Indian rupees [rounded to nearest 50 thousand rupees]
- FamilyMembers - Number of members in customer's family
- ChronicDisease - Whether the customer suffers from any major disease or conditions like diabetes/high BP, asthma, etc.
- FrequentFlyer - Derived data based on customer's history of booking air tickets on at least 4 different instances in the last 2 years [2017-2019].
- EverTravelledAbroad - Has the customer ever travelled to a foreign country [not necessarily using the company's services]
- TravelInsurance - Did the customer buy travel insurance package during introductory offering held in 2019.
Wrangling and EDA
The data of my model was not very balanced but almost. I had more that 60% of passengers choose not to purchase insurance with their travel package and a little less than 40% choose to include insurance packages with their plans. What I did in my wrangle function is, I dropped any high school columns and / or any columns that were not as useful to my model prediction or leaking some information to my target vector (the [TravelInsurance] column). I preformed some explanatory data analysis to find out the type of data I am dealing with in each column by running 'df.info()' and also to see the overall distribution of my data in each column with the 'df.describe()'. Additionally, there are zero null values in this dataset.
I went ahead and setup my "X" matrix and "y" vector. Then, I split my model/dataset into training and validation sets as follows:
Baseline Accuracy = 0.46%
Random Forest Modeling
I hypertuned the model with the following parameters:
n_estimators = 100
random_state = 42
max_depth = 5
Accuracy Score of my Model: 0.84%
My random forest model accuracy score was much better than my baseline accuracy score which is good. I am at an average that is 84% accurate
precision recall f1-score support
0 0.82 0.92 0.87 257
1 0.82 0.63 0.71 141
accuracy 0.82 398
macro avg 0.82 0.78 0.79 398
weighted avg 0.82 0.82 0.81 398
Gradient Boosting Classifier:
model_gbc Training Accuracy: 0.8432976714915041
model_gbc Validation Accuracy: 0.8417085427135679
Feature Importance:
EverTravelledAbroad 0.013082
GraduateOrNot 0.013964
ChronicDiseases 0.039366
FrequentFlyer 0.040516
Age 0.114855
FamilyMembers 0.246217
AnnualIncome 0.532000
Feature Importance
Correlation Between Annual Income and Insurance Purchase
Conclusion
My model obviously did better that baseline which I saw coming because of the slightly good balance of the two values of my target vector ~60% vs 40%. I am still studying this dataset deeper. I will probably need to OneHotEncode my target and perform a linear or ridge regression to examine the model in somewhat a different way. I would say that this project was not too bad to begin with as far as data wrangling and cleaning is concerned. I admit that I am still skeptical with a few modeling techniques that usually require some try and play with. My model accuracy was awesome as compared to a few of the recent scores of some individuals who worked on the same dataset prior. Our model also showed, based on figuring out the feature importance, that passengers with high income tend to buy insurance packages, and the older the passenger, the more likelihood of that passenger to purchase insurance as well.
Comments
Post a Comment