This week’s work comes after attending the Data science workshop in Nyeri. It’s taken a while to log for this week. Focus for this week is on;
Logistic Regression with Python
Import Libraries
Let’s import some libraries to get started!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
The Data
We will be working with the Titanic Data Set from Kaggle downloaded as titanic_train.csv file
train = pd.read_csv('titanic_train.csv')
train.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
Exploratory Data Analysis
Some exploratory data analysis!
We’ll start by checking out missing data!
Missing Data
We can use seaborn to create a simple heatmap to see where we are missing data!
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
# An assessment of data available, Age and Cabin have missing values while the rest
# are relatively OK.
<matplotlib.axes._subplots.AxesSubplot at 0x1a208d1908>
Visualizing some more of the data
analysis by column. By Survival
sns.countplot(x='Survived',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1a20274080>
Survival by Gender
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
<matplotlib.axes._subplots.AxesSubplot at 0x1a20d895c0>
Survival by Passenger Class
sns.countplot(x='Survived',hue='Pclass',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1a20976320>
Distribution of Passengers on board by Age
sns.distplot(train['Age'].dropna(),kde=False,bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x1a20e2ac18>
Passengers onboard with sibling(s) / spouse
sns.countplot(x='SibSp',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1a20ede5c0>
Passengers by amount of fare paid
train['Fare'].hist(bins=20,figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x1a210737b8>
Data Cleaning
Imputation.
Filling out missing values by approximation Filling in the mean age to the age column
Start of by checking the average age by passenger class.
plt.figure(figsize=(10,7))
sns.boxplot(x='Pclass',y='Age',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1a20ff03c8>
Wealthier passengers in the higher classes tend to be older,
We’ll use these average age values to impute missing data based on Pclass for Age.
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
Apply impute_age
function
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)
And by checking for missing values on our data, we have;
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<matplotlib.axes._subplots.AxesSubplot at 0x1a212b75f8>
We can Drop the Cabin
column as it possesses a huge percentage of missing values and filling
in may not be appropriatte.
Also we will drop the few instances on the Embarked
column
train.drop('Cabin',axis=1,inplace=True)
train.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C |
#drop missing row record from embarked column
train.dropna(inplace=True)
Convert Categorical Features
We need to convert categorical features to dummy variables using pandas,
Otherwise the learning algorithm won’t be able to directly take in those features as inputs.
For the sex column, caterorize if passenger is male or not(1 | 0 ) |
On embarkment point it will be Q, S 0r C.
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
Concatenate the generated categorical columns to the dataset
train = pd.concat([train, sex,embark],axis=1)
train.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | male | Q | S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 1 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 0 | 0 | 0 |
Select Columns that we will use for the model
train.drop(['Name','Sex','Embarked','Ticket','PassengerId'],axis=1,inplace=True)
train.head(2)
Survived | Pclass | Age | SibSp | Parch | Fare | male | Q | S | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 1 |
1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 |
And the data is ready for our model!
Building a Logistic Regression model
Start by splitting data into a training set and test set
Train Test Split
X = These are the features we will use to predict
y = Value we are predicting ie Did the passenger survive
X = train.drop('Survived',axis=1)
y = train['Survived']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Training and Predicting
from sklearn.linear_model import LogisticRegression
# create an instance of LR model
logmodel = LogisticRegression()
# train the model
logmodel.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
# predict using the model
predictions = logmodel.predict(X_test)
Evaluate the Model
Using classification report, We can check : - precision - recall - f1-score
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
precision recall f1-score support
0 0.80 0.91 0.85 163
1 0.82 0.65 0.73 104
avg / total 0.81 0.81 0.80 267
A confusion matrix can also be applied
in order to determine how many observations were correctly or incorrectly classified.
confusion_matrix(y_test,predictions)
array([[148, 15],
[ 36, 68]])