I took some nerve to start the Kaggle but am really glad I did

get to start after multiple false starts.

By following this you’ll be able to score atleast top 5000th position on the leaders board.

Let’s import some libraries to get started!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

The Data

We will be working with the Titanic Data Set from Kaggle downloaded as train.csv file

train = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

train.head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C

Exploratory Data Analysis

Some exploratory data analysis!

We’ll start by checking out missing data!

Missing Data

We can use seaborn to create a simple heatmap to see where we are missing data!

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

# An assessment of data available, Age and Cabin have missing values while the rest
# are relatively OK.

png

Visualizing some more of the data

analysis by column. By Survival

sns.countplot(x='Survived',data=train)

png

Survival by Gender

sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

png

Survival by Passenger Class

sns.countplot(x='Survived',hue='Pclass',data=train)

png

Distribution of Passengers on board by Age

sns.distplot(train['Age'].dropna(),kde=False,bins=30)

png

Passengers onboard with sibling(s) / spouse

sns.countplot(x='SibSp',data=train)

png

Passengers by amount of fare paid

train['Fare'].hist(bins=20,figsize=(10,5))

png

Data Cleaning

Imputation.

  • Filling out missing values by approximation
  • Fill in the mean age to the age column

Start of by checking the average age by passenger class.

plt.figure(figsize=(10,7))
sns.boxplot(x='Pclass',y='Age',data=train)

png

Wealthier passengers in the higher classes tend to be older,

We’ll use these average age values to impute missing data based on Pclass for Age.

def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]

    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24

    else:
        return Age

Apply impute_age function

train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

And by checking for missing values on our data, we have;

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

png

We can Drop the Cabin column as it possesses a huge percentage of missing values and filling in may not be appropriatte.

Also we will drop the few instances on the Embarked column

# train.drop('Cabin',axis=1,inplace=True)

Check that the dataset has been well preprocessed.

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB

Convert Categorical Features

We need to convert categorical features to dummy variables using pandas,

Otherwise the learning algorithm won’t be able to directly take in those features as inputs.

For the sex column, caterorize if passenger is male or not(1 0 )

On embarkment point it will be Q, S 0r C.

sex = pd.get_dummies(train['Sex'],drop_first=True)

embark = pd.get_dummies(train['Embarked'],drop_first=True)

Concatenate the generated categorical columns to the dataset

train = pd.concat([train, sex,embark],axis=1)
train.head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked male Q S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 1 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 0 0 0

Select Columns that we will use for the model

train.drop(['Name','Sex','Embarked','Ticket'],axis=1,inplace=True)
# train.drop('PassengerId',axis=1,inplace=True)
train.head(2)
Survived Pclass Age SibSp Parch Fare male Q S
0 0 3 22.0 1 0 7.2500 1 0 1
1 1 1 38.0 1 0 71.2833 0 0 0

Prep the test Set

test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           418 non-null float64
male           418 non-null uint8
Q              418 non-null uint8
S              418 non-null uint8
dtypes: float64(2), int64(4), uint8(3)
memory usage: 20.9 KB
test_df['Fare'].fillna(test_df['Fare'].mean(), inplace=True)
test_df['Age'].fillna(test_df['Age'].mean(), inplace=True)
sex_t = pd.get_dummies(test_df['Sex'],drop_first=True)

embark_t = pd.get_dummies(test_df['Embarked'],drop_first=True)
test_df = pd.concat([test_df, sex_t,embark_t],axis=1)
test_df.drop(['Name','Sex','Embarked','Ticket','Cabin'],axis=1,inplace=True)
test_df.head(2)
PassengerId Pclass Age SibSp Parch Fare male Q S
0 892 3 34.5 0 0 7.8292 1 1 0
1 893 3 47.0 1 0 7.0000 0 0 1
test_df.head(2)
PassengerId Pclass Age SibSp Parch Fare male Q S
0 892 3 34.5 0 0 7.8292 1 1 0
1 893 3 47.0 1 0 7.0000 0 0 1

And the data is ready for our model!

Building a Logistic Regression model

Start by splitting data into a training set and test set

Train Test Split

X = These are the features we will use to predict

y = Value we are predicting ie Did the passenger survive
X = train.drop('Survived',axis=1)
y = train['Survived']

Prep the test dataset

from sklearn.model_selection import train_test_split
X_train = train.drop('Survived',axis=1)
y_train = train['Survived']
X_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Pclass    891 non-null int64
Age       891 non-null float64
SibSp     891 non-null int64
Parch     891 non-null int64
Fare      891 non-null float64
male      891 non-null uint8
Q         891 non-null uint8
S         891 non-null uint8
dtypes: float64(2), int64(3), uint8(3)
memory usage: 37.5 KB
X_test  = test_df.drop('PassengerId',axis=1)
X_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
Pclass    418 non-null int64
Age       418 non-null float64
SibSp     418 non-null int64
Parch     418 non-null int64
Fare      418 non-null float64
male      418 non-null uint8
Q         418 non-null uint8
S         418 non-null uint8
dtypes: float64(2), int64(3), uint8(3)
memory usage: 17.6 KB

Training and Predicting

from sklearn.linear_model import LogisticRegression

Create an instance of Linear Regression model

logmodel = LogisticRegression()

Train the model

logmodel.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Make Predictions using the model

predictions = logmodel.predict(X_test)

Generate Submission File

The Kaggle evaluation will be based upon the Predictions made in reference to ‘PassengerId` from the test.csv

Submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
Submission.to_csv("4bic_titanic_submission.csv", index=False)

I got an accuracy of 0.706,, find ways to improve