Travel Insurance Purchase Prediction

 



Prediction 

My model will try set itself to predict whether or not a passenger will purchase travel insurance.


Context

A tour & travels company is offering travel insurance packages to their customers. The new insurance package also includes Covid cover. The company requires to know which customers would be interested to buy it based on its database history. The insurance was offered to some of the customers in 2019 and the given data has been extracted from the performance/sales of the package during that period. The data is provided for almost 2000 of its previous customers and you are required to build an intelligent model that can predict if the customer will be interested to buy the travel insurance package based on certain parameters given below.


Inspiration

The solution offered by you may be used for customer specific advertising of the package. Exploratory data analysis performed on the data would help find interesting insights. Predict whether a given customer would like to buy the insurance package, once the corona lockdown ends and travelling resumes. your work could probably help save thousands of rupees for a family.


About Data

  1. Age - Age of the customer
  1. Employment Type - The sector in which customer is employed
  1. GraduateOrNot - Whether the customer is college graduate or not
  1. AnnualIncome - The yearly income of the customer in Indian rupees [rounded to nearest 50 thousand rupees]
  1. FamilyMembers - Number of members in customer's family
  1. ChronicDisease - Whether the customer suffers from any major disease or conditions like diabetes/high BP, asthma, etc.
  1. FrequentFlyer - Derived data based on customer's history of booking air tickets on at least 4 different instances in the last 2 years [2017-2019].
  1. EverTravelledAbroad - Has the customer ever travelled to a foreign country [not necessarily using the company's services]
  1. TravelInsurance - Did the customer buy travel insurance package during introductory offering held in 2019.


Wrangling and EDA

The data of my model was not very balanced but almost. I had more that 60% of passengers choose not to purchase insurance with their travel package and a little less than 40% choose to include insurance packages with their plans. What I did in my wrangle function is, I dropped any high school columns and / or any columns that were not as useful to my model prediction or leaking some information to my target vector (the [TravelInsurance] column). I preformed some explanatory data analysis to find out the type of data I am dealing with in each column by running 'df.info()' and also to see the overall distribution of my data in each column with the 'df.describe()'. Additionally, there are zero null values in this dataset.










I went ahead and setup my "X" matrix and "y" vector. Then, I split my model/dataset into training and validation sets as follows:





Baseline Accuracy = 0.46%


Random Forest Modeling

I hypertuned the model with the following parameters:

        n_estimators = 100

        random_state = 42

        max_depth = 5

Accuracy Score of my Model: 0.84%

My random forest model accuracy score was much better than my baseline accuracy score which is good. I am at an average that is 84% accurate







Confusion Matrix

     precision recall f1-score support
0 0.82 0.92 0.87 257 1 0.82 0.63 0.71 141 accuracy 0.82 398 macro avg 0.82 0.78 0.79 398 weighted avg 0.82 0.82 0.81 398






Gradient Boosting Classifier:

model_gbc Training Accuracy: 0.8432976714915041 model_gbc Validation Accuracy: 0.8417085427135679





Feature Importance:

EverTravelledAbroad 0.013082 GraduateOrNot 0.013964 ChronicDiseases 0.039366 FrequentFlyer 0.040516 Age 0.114855 FamilyMembers 0.246217 AnnualIncome 0.532000



Feature Importance






Correlation Between Annual Income and Insurance Purchase





Conclusion

My model obviously did better that baseline which I saw coming because of the slightly good balance of the two values of my target vector ~60% vs 40%. I am still studying this dataset deeper. I will probably need to OneHotEncode my target and perform a linear or ridge regression to examine the model in somewhat a different way. I would say that this project was not too bad to begin with as far as data wrangling and cleaning is concerned. I admit that I am still skeptical with a few modeling techniques that usually require some try and play with. My model accuracy was awesome as compared to a few of the recent scores of some individuals who worked on the same dataset prior. Our model also showed, based on figuring out the feature importance, that passengers with high income tend to buy insurance packages, and the older the passenger, the more likelihood of that passenger to purchase insurance as well.






Comments

Popular posts from this blog

Correlation of Literacy and Infant Mortality