Kaggle titanic challenge
introduction
The kaggle titanic competition is the ‘hello world’ exercise for data science. Its purpose is to
Predict survival on the Titanic using Excel, Python, R & Random Forests
In this post I will go over my solution which gives score 0.79426 on kaggle public leaderboard. The code can be found on github. In short, my solution involves soft majority voting on logistic regression and random forest classifiers.
pre-processing
For this dataset, the pre-processing step includes the following operations
- imputing missing values
- mapping nominal features
The data can be loaded using read_csv()
function.
# load data
df = pd.read_csv("train.csv")
df.info(null_counts=True)
The function info(null_counts=True)
provides an overview of the data type of the
columns and the missing value counts, as seen below.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
Several columns do not provide much information to the passenger’s survival thus I drop them.
They are passenger’s id PassengerId
, name Name
, ticket number Ticket
.
The Cabin
column is also dropped due to its incompleteness: only about 200 entries are available.
There are missing values in the Age
and Embarked
columns. I fill them with the following code.
# fill in missing values
df.Age.fillna(df.Age.median(), inplace=True)
df.Embarked.fillna('S', inplace=True)
# one-hot encoding on nominal features 'Sex' and 'Embarked'
df.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1, inplace=True)
df = pd.get_dummies(df)
Here the missing Age
values are filled with median value and the missing Embarked
values
are filled with S
, which occurs 644 times out of the 889 existing entries.
The two columns Embarked
and Sex
contain string values that are nominal categorical features.
The get_dummies()
function converts them into one-hot encoded vectors.
For example, the Sex
column is expanded into two columns Sex_female
and Sex_male
whose
values are either 0 or 1.
training
My strategy is to do majority voting with multiple classifiers. This is conveniently implemented by the sklearn.ensemble.VotingClassifier
.
The relevant code is shown below.
The logistic regression and random forest classifiers have accuracy of 0.796 and 0.832 respectively.
The combined result has an accuracy of 0.833 on the training data. The code is below.
predictors = df.columns.tolist()
predictors.remove('Survived')
clf1 = LogisticRegression(C=1, random_state=1, warm_start=True)
clf2 = RandomForestClassifier(random_state=1, n_estimators=10,
min_samples_split=5, min_samples_leaf=2, max_features=3)
eclf = VotingClassifier([('lr', clf1), ('rf', clf2)], voting='soft')
scores = cross_val_score(eclf, df[predictors], df['Survived'], cv=15)
print('CV accuracy: %.3f +/- %.3f' % (scores.mean(), scores.std()))
I also played with more classifiers, such as SVM, naive Bayes, K nearest neighbors, etc. Each of them gives an accuracy around 0.8. When combined into the ensemble classifier, the performance gets slightly worse on the training data. Actually even for the two classifier case, the ensemble classifier does not always perform better than the individual ones. I guess all these classifiers do well on say 75% of the data and unfortunately it is the same 75% of the data. Recall each classifier gets about 80% of the data right. Maybe the extra 5% of the correct classification from different methods does not overlap, making voting ineffective.
Let me demonstrate this explanation in the following pictures. For simplicity, let’s assume there are 3 classifiers for voting. When used individually, each gets 80% of the data right.
In the pictures, the horizontal direction denotes the data and the three horizontal bars denote the three classifiers. The blue region denote the part of the data where the classifiers give correct result. Ideally, we want something like below to happen, such that 100% accuracy will be achieved using majority voting.
In the worst case, the three classifiers get correct result on the same 70% of the data and the remaining three 10%s do not overlap. Then we only get 70% accuracy for the ensemble classifier, which is worse than the accuracy from individual classifier!
It’s not clear to me how to make sure the first case rather than the second case happens for our classifiers. Please drop me a comment if you have insights on this.
feature engineering
Up to this point, my score on the kaggle public leaderboard is 0.78469. To get better result, I tried several tricks from various blogs, for example
The one that worked for me is to add a title
feature.
The idea is that the survival rate may depend on how ‘important’ the person is.
title_map = {'Ms':'Mrs', 'Mlle':'Mrs', 'Mme':'Mrs', 'Capt':'Sir',
'Dr':'Sir', 'Rev':'Sir', 'Major':'Sir', 'Col':'Sir', 'Don':'Sir',
'the Countess':'Lady', 'Jonkheer':'Lady', 'Dona':'Lady' }
After adding the title feature, my score was boosted to 0.79426.
what next?
According to the kaggle forum, a submission with 82-84% accuracy is considered very good. Anything beyond 85% may require some cheating.
It will be nice to push my score beyond 80%. There are several possibilities:
- more feature engineering??
- Extract the title from
Name
column, like here - Maybe also family size
- Turn
Age
into child/adult
- Extract the title from
- fine tuning the individual classifiers more
- change the classifier
- stacking
Please let me know if you have other suggestions on improving the score.