Today, we are brushing up our history knowledge with one of the most famous maritime disasters of all times: the sinking of the Titanic.
This competition is the beginners’ competition on Kaggle. Its aim is to help you familiarise yourself with the concept of Machine Learning in the context of data analysis. The competition is open to anyone, but it is recommended to have some notions of Python and basic ML to perform better.
What Is This About?
If you need a little historical catch-up, here’s what Wikipedia has to say about the disaster:
RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history’s deadliest peacetime commercial marine disasters.
You might remember it better if you listen to this:
The idea of this competition is to predict who survived the iceberg and who didn’t, give a certain number of parameters. There are several approaches for this, but this competition is specifically about machine learning so this is what I’ll be using.
What Is Machine Learning?
Machine Learning (ML) algorithms are an ensemble of computer programmes that analyse data, and find patterns in order to make predictions and/or decisions. It is a type of artificial intelligence that is specifically applied to data (especially large samples of data).
To give you a simple example: let’s say you want to predict how popular a baby name will be next year. You can use a ML algorithm that will take the information about baby names from the previous years and predict how well a specific name will do next year from that information.
This project is actually not my first time using ML. During my master thesis, I used artificial neural networks (which is a type of model used in ML) to predict the energy of particles found in a detector. Today, I will use a decision tree model, specifically a Random Forest Classifier (more on that later).
In all ML models, you need a sample data to build your mathematical model. This is known as “training data”. For this data, you know everything. If I take my previous example, this would be the data on baby names from the previous years. You “feed” this data into your algorithm and you teach it to recognise patterns that will help predict the outcome. Then, once you are sure your model works, you apply it to your “testing data”, aka the data for which you’d like the predict the outcome.
So, if we go back to the Titanic problem: Kaggle decided to divide the passengers into two groups. For the first group (= training data), we know everything: name, class, age and whether they survived or not. For the second group (= test data), we have all the information EXCEPT if they survived. The challenge is to predict whether the passengers from the second group survived or not.
Exploring the Data
The Spreadsheet
The first thing I did what to take a look at the data. The file containing the training data looks like this:

Each row represents a passenger, for which we have various information:
- PassengerId: the ID number given for each passenger in this dataset. You can see it as the row number.
- Survived: 0 if they died, 1 if they survived.
- Pclass: whether they were in 1st, 2nd or 3rd class.
- Name: full name including title and sometimes maiden name.
- Sex: male or female
- Age: in years
- Sibsp: number of siblings and/or spouses aboard the Titanic
- Parch: number of parents and/or children aboard the Titanic
- Ticket: the ticket number
- Fare: how much they paid for their ticket
- Cabin: the cabin number
- Embarked: which port they embarked from. C = Cherbourg, Q = Queenstown, S = Southampton.
The testing data’s layout is exactly the same, without the “Survived” column.
There are 891 rows in the training data, meaning 891 passengers. With a very quick describe() function, I can get some statistical information about this data:

Immediately we notice something interesting in the “count” row: the number for Age is lower, which means that, for some passengers, no age is present in the table. Note that this information only took in account the column that had numbers in them, not strings (which makes sense), so we have blanks in other columns too (especially in the “cabin” column).
I will need to take this in account, as missing information can skew the model quite dramatically.
Best Feature: Sex
A great way to understand the data better and find some patterns is to plot various information. I used the MatPlotLib library for all following graphs.
The first information (= feature) I decided to plot was the difference between men and women in terms of survival:

But since there are more men than women in this sample, I need to use relative values to compare them better.

As you can see, women had almost 3 times more chances to survive the Titanic than men. This correlation is so strong that if you just use this parameter to predict who dies and who survives in your test sample, you would be 76.55% correct! The goal is to be more accurate than this simple prediction, and therefore to increase this number.
More Features
Sex, however, is not the only important data of this sample. Class also plays a role: people in first class had more chances to survive than those in second and third class. Plotted (relatively to the number of passengers in each class):

Next, I decided to explore two variables that puzzled me when I saw them: SibSp and Parch (number of siblings/spouses and parents/children aboard). Here are the plots for those two parameters:


Both seem to influence the survival rate more than I had anticipated. According to these plots, you had more chances to survive if you had 1 or 2 spouse/siblings, and 1 to 3 parents/children aboard the Titanic.
First Attempt at Modelling
Now that I have four basic features that I know will influence the survival outcome for the passengers, it’s time to model and predict. Note: I followed this tutorial for this first attempt.
As said earlier, I used a Random Forest Classifier model. This model is a collection of something called “decision trees”.
Here’s a very simple way to understand what a decision tree is (from AI Time Journal):
The decision tree algorithm works like a human brain every time we ask a question ourselves before making a certain decision. For example, it is discount offers are going on online shopping? If yes, then I will buy the products else I will not.

A Random Forest Classifier is a group of trees. More trees mean that it usually works better.
So, I used a Random Forest Classifier from the Sklearn library to work on my data. Practically, it looks like this:
y = train_data["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch"]
X_test = pd.get_dummies(test_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
y_test = model.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': y_test})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")
I “fed” it with the four basic features, and asked it to record an outcome for the test sample. The outcome would be very simple: 1 if the model predicted survival, and 0 if the model predicted death.
The output document looked like this:

I submitted the output document on Kaggle, and got my score: 77.51%! While this is better than the simple male/female decision, it is still far from being accurate.
The goal of this project is to increase this number as much as possible, and the closest you are to 100%, the higher you will rank in the competition. With this simple analysis, I’m only ranked 10672 out of 18923 (in the top 57%). In other words, could do better!
Improving My Rank: Feature Engineering
So one of the ways that I can improve the model’s predictive score is to add more features and tweak them to be more relevant. I’m going to attempt a few things in this section and see what the effect is on my score.
Fare
In addition to the class of the tickets, we have some information about the price of the tickets for each passenger. Let’s do some plotting to see if it has any influence on the survival rate.

As you can see, ticket fares seem to influence the outcome too, as you would expect (since we already determined that ticket class influence the outcome and first class tickets were more expensive than third). However, this trend also appears within the same class. Here are the same plot but for the second class:

As you can see, people who had more expensive tickets within this class had more chances to survive. This trend is a little less clear in first and third class, but it seems to me that “Fare” is a feature that should be added to the Random Tree Classifier.
This is where the “engineering” part starts: there is one missing value for “Fare” in the test data:

It turns out that the algorithm is not really happy with a missing value and doesn’t know what to do with it. To remedy this, I replace the missing value with the average of fare prices:
test_data["Fare"] = test_data["Fare"].fillna(test_data["Fare"].median())
Result: 77.99%! My rank is now 6998. Progress!
Title
You might not think that the names of the passengers was important, but it is. I came across this tutorial about basic feature engineering that explained how, from the name field in the dataset, I could extract the title of the person. And boy, there are a lot of them in the training data!
{'Col', 'the Countess', 'Master', 'Ms', 'Mrs', 'Major', 'Sir', 'Jonkheer', 'Rev', 'Lady', 'Miss', 'Mlle', 'Dr', 'Don', 'Mme', 'Mr', 'Capt'}
Similarly, in the testing data:
{'Mr', 'Col', 'Rev', 'Master', 'Dr', 'Ms', 'Miss', 'Dona', 'Mrs'}
It’s interesting to me that some titles are in other languages such as French, Spanish or Dutch. This will matter when it comes to classifying them. For example I learned reading the tutorial that “Jonkheer” is an honorific title for men in the Dutch nobility. Clearly it has to be treated differently than the simple “Mr”.
I decided to divide these into 5 categories: Miss, Mrs, Mr, Noble, Crew. If I plot the survival against these 5 categories:

As you can see, there is a strong correlation between the titles and the survival rate. Especially, if your title was “Mr” without being noble, you had much less chance to survive. This refines the “Sex” feature quite nicely.
I fed this new feature into the model, and my accuracy jumped to 78.95%!
Age
Next parameter that I decided to study was the age of the passengers. As usual, I started plotting to see if there was any correlation with survival. I grouped ages by decades to make the plot more clear:

The only area where age seems to make a difference is between 0 and 10, aka the children. I decided to check if that behaviour was the same for men and women:


The results are interesting: for women, the age doesn’t seem to make a big difference. However, for men, we see that the behaviour of the histogram changes for boys under 10. They had more chance of survival than men from any other age category.
In other words, this perfectly represents the “women and children first” attitude of the time.
The phrase was popularised by its usage on the RMS Titanic. The Second Officer suggested to Captain Smith, “Hadn’t we better get the women and children into the boats, sir?”, to which the captain responded: “put the women and children in and lower away”. The First and Second officers (Murdoch and Lightoller) interpreted the evacuation order differently; Murdoch took it to mean women and children first, while Lightoller took it to mean women and children only. Second Officer Lightoller lowered lifeboats with empty seats if there were no women and children waiting to board, while First Officer Murdoch allowed a limited number of men to board if all the nearby women and children had embarked. As a consequence, 74% of the women and 52% of the children on board were saved, but only 20% of the men.
Wikipedia – Women and Children First
It would therefore make sense to add the “Age” feature to my model. However, I quickly ran into the same issue as the “Fare” feature: some ages are missing. In the training data, we are missing 177 ages and in the testing data we are missing 286. These numbers are significant enough to create a problem if I try to use the average age for all of these missing data, the same way I did for that one fare missing.
I came across this tutorial that suggested to calculate the median depending on certain criteria, for example the sex, the title and the class. So I calculated this for all the missing ages and fed the results to my model.
Unfortunately, the percentage obtained when adding the age information was 77.99%, less than when that information is not there!
It seems to me like the age “dilutes” the findings, which could be due to how many of them we had to guess. The median might not be the best way to calculate the missing ages.
What Now?
I tried to add/remove a few of these features, but I can’t seem to go past the 78.95% accuracy.
Since I was stuck, I started a discussion on Kaggle about my problem, and got a lot of advice about where I could go from here. I will have to test and compare these solutions and see what works best for me. I also intend on continuing the free courses on Kaggle to increase my knowledge in Machine Learning, particularly things like Deep Learning and TensorFlow.
Since this article is already super long, I’ll write another one with my findings once I have progressed. Stay tuned!
EDIT: Here’s the second article!
Links
- All the codes used for this project are in my GitHub
- The sinking of the RMS Titanic on Wikipedia
- The Titanic competition on Kaggle
- Decision tree model
- How to score 0.8134 🏅 in Titanic Kaggle Challenge
- Basic Feature Engineering with the Titanic Data
- Free Kaggle courses about Data Science
awesome!
I see you are looking for a job.
If I was an employer I sure would hire you.
cheers,
Atalay
Thank you 🙂