Insurance Claim Visualizations
Utilizing a dataset of auto insurance claims from Kaggle, I was able to visualize insights using Python libraries pandas, matplotlib and seaborn. Analysis was done using JupyterLab.
Step 1: Importing, Cleaning,
Using pandas to bring the dataset into JupyterLabs and prepare it for analysis.

I started by taking down some general analysis before these visualizations. Things like the mean, standard deviation, min and max of each column. I wanted to understand the claims data at a high level before letting my curiosity lead me to the visualizations above.
There was one column that had "months with carrier", meaning how long an insured has had insurance with this company. It seemed to have a very high max, and a large standard deviation, so I removed over 2x the standard deviation to get the mean "months with carrier" of 191 (15 years). Many insurance carriers have a typical policy lifecycle of 7 years so this is a really good average even without the outliers.
​
The mean age of a policy holder is 38, min is 19, max is 64. This is a favorable spread of ages.Typically you'd want your age groups to be over 18 and under 70.
Mean premium is $1,257 and the mean amount of vehicles involved in the claim (remember this is claims data) is just 1.8.
​
​
​

Step 2: Visualizations
Using pandas, Matplotlib and Seaborn.
My first visualizations were a way at analyzing demographics. I wanted to know if sex, education, occupation or relationship status would indicate higher or lower losses. Unsurprisingly, there were no trends here. Everyone was within one standard deviation of each other.

Then I threw down a heat map in order to get a snapshot of numeric correlation. Months as customer and age were closely correlated, which doesn't require much explanation. The separate claim categories were correlated, because if you have an injury, it's likely that there is also property damage. There was also a slight correlation between loss amount and the time of day that the accident occurred. As well as the amount of vehicles involved and the total loss amount. I was surprised that there wasn't more of a relationship between the hour of day, vehicles involved or even age and total losses.

The months with carrier was converted by years for my histograms. They show a loyal customer base between 10-25 years. With an average age range between 30-45. Meaning most of the customers started with them in their early 20s and never left.

A scatter plot was used as another way to show the correlation (or lack thereof) between age, # of vehicles involved, months as customer and total loss. This helped confirm what the heat map outlined for us.

I used three line graphs in this project. One showing losses over time, which had a spike in February.

The next graph showed the difference between property and Injury claims over time. This told me that Injury claims were slightly higher than property claims and are trending slightly higher as well.

Vehicle claims followed a similar trend, but they bottom out around $37,000 whereas the Injury and Property claims average about $8,000. Telling us that damage to the vehicle itself presents the largest payout for this insurance carrier.

This box plot visualizes the distribution of claim pay outs, based on the number of witnesses at the scene of the accident. The graph shows minor differences but still confirm what our heat map told us already. That witnesses have no significant correlation to claim payouts.

The strip plot showed us that single vehicle losses were slightly more costly than multi-vehicle collisions. Which can make sense as there are other parties at fault and other laws like PIP that can protect the insurance carrier from paying out for damage to other vehicles or injuries to others.

A joint plot was used to get granular on the time of day. It looks like accidents tend to happen in the late night/early morning.

Finally, you can see the claim count by state, as well as the total dollars paid out per state.


The most interesting find from my analysis didn't have to do with numeric correlations, but with the actual states that the losses occurred in.
All of these policies have garaging locations within the same three states. Yet most of the accidents occur outside of those states. What the final bar graphs don't show is that there hasn't been one loss in Illinois all year. Even though 1/3 of the policies are IL policies. This information is a flag for a deep -analysis because it could mean you have insurance fraud on your hands.