Here we go again! The Titanic competition is back, and I’m determined to improve my score!
If you remember my previous post about this competition, the goal was to use machine learning to predict survival and death of a sample of passengers. The idea was to use information such as their gender and their ticket fare to calculate if they were more likely to survive or die in the disaster.
I started this analysis with a very simple prediction relying on gender, with a 76.55% accuracy. Then I used a Random Forest model with a few parameters and jumped to a 77.51% accuracy. I then worked on feature engineering with some success and some failures, and ended up with an accuracy of 78.95%. My goal today is to get as close to 80% as I can.
What I Did in the Meantime
Since I was stuck, I decided to learn a bit more about Machine Learning and how to work on these models. I completed the Intermediate Machine Learning course on Kaggle. I also tested a lot of new techniques on the Housing Price competition that is associated with this course.
And as I mentioned in the previous post, I also opened a discussion on Kaggle to get tips and advice from other Titanic competitors. This discussion gave me a lot of ideas of things I could try to improve the score. This is what I’m going to try today!
The first small improvement I made in the dataset was to group the columns “SibSp” (number of siblings and spouses on board of the Titanic) and “ParCh” (number of parents and children on board of the Titanic) into one column: “Family”.
It didn’t improve the accuracy but it was a way for me to tidy the data.
One of the tips that came back the most was to use “bins” for ages instead of the true value. Continuous variables such as the age of passengers can be a little bit of a problem for decision trees. As I understand it, the tree can treat “24” and “25” as very different, while they should actually be treated as similar.
To go around this problem, the simplest way is to group ages together. Which is, if you remember it, what I did when I visualised the survival rate per age:
Grouping ages by decade seems the most straightforward approach. Unfortunately, when I add this information to the model, I once again get a percentage lower than when I don’t add it: 78.47%! However, there is an improvement compared to when I add the ages without grouping them.
Looking at other notebooks (in particular the very helpful notebook by Sid2412), I decided to group the ages by “survivability”:
train_data['AgeGroup'] = pd.cut(train_data['Age'],5) print(train_data[['AgeGroup', 'Survived']].groupby('AgeGroup', as_index=False).mean().sort_values('Survived', ascending=False))
Basically the idea is to give a “weight” to each age group that corresponds to their chance of survival. I tinkered a little bit with how many groups would be ideal:
In the end, I decided to go with 5 groups (like in the example notebook linked above) because this is where I saw the biggest differences between the groups.
Since this approach worked so well for the ages, I decided to apply it to the other continuous variable of this dataset: the ticket fares.
Using the same exploratory code, I tinkered with the number of groups that would be ideal for this variable:
As you can see, if I divide this feature into too many groups, there is no information in some of them. I therefore decided to go with 3 of them.
This is a feature that I haven’t used yet: the port where the passengers embarked. Passengers on the Titanic embarked in 3 different ports: Cherbourg (C), Queenstown (Q) and Southampton (S). That information is available for almost all passengers except for two passengers in the training dataset:
A quick Google research taught me that both women embarked in Southampton. I therefore filled that information into my dataset.
Intuitively, I don’t think it would influence the survival rate. However, let’s plot it to make sure:
Now, that is surprising: there seems to be a strong correlation between the survival and embarkment! It looks like I need to add this feature to my model.
Indeed, when I calculate the survival rate for each port, this is what I get:
Therefore I decided to map the ports with their respective survival rate.
In my previous post, I used the model called “Random Forest Classifier”. This time, I will test other models and see if I get a better accuracy.
Gradient Boosting Classifier
The whole idea of “boosting” in machine learning is to deal with something called a “weak learner”. A weak learning is an algorithm or hypothesis that gives results that are a bit better than if it had been left to chance, but not much better.
Hypothesis boosting was the idea of filtering observations, leaving those observations that the weak learner can handle and focusing on developing new weak learns to handle the remaining difficult observations.A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
Weak learners are typically decision trees that are not performant enough on their own. So the idea here is to add more decision trees little by little, and we use a “gradient descent” to minimise the loss of accuracy each time we add a tree.
The difference with a Random Forest is that here the trees are combined gradually, while for a Random Forest the trees are combined at the end.
XGBoost stands for eXtreme Gradient Boosting. It’s a type of Gradient Boosting that is designed to improve speed and efficiency.
It is very popular in machine learning competition.
Support Vector Classifier
The idea of this algorithm is very different from the previous one. It uses the same logic as logistic regression, but for a larger number of parameters. The goal is to find what is called a “hyperplane” that separates the data into categories (here: survived and dead). The hyperplane has a number of dimensions equal to the number of features.
It is quite a simple and fast algorithm compared to using decision trees, but it’s still very efficient.
And the Winner Is…
The best model was the XGBoost, which gave me an accuracy of 79.425%!
I haven’t reached the 80% accuracy that I wanted but I am really close to it. I’ve decided to leave it at that for now, and start working on other datasets and other courses.
I have learned A LOT working on this Titanic competition. Machine Learning is fascinating and I could spend years on this dataset alone, fine tuning and improving the features/parameters/models. I cannot wait to apply my newfound skills to other fields and areas of data science.
- My Notebook on Kaggle with the whole code
- A Simple and effective approach to ML, by Sid2412
- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- Decision Tree vs Random Forest vs Gradient Boosting Machines: Explained Simply
- A Gentle Introduction to XGBoost for Applied Machine Learning
- Support Vector Machine — Introduction to Machine Learning Algorithms