Practical Deep Learning, Lesson 3, Stochastic Gradient Descent on the Titanic Dataset
Series: Fast.ai Course
In this notebook, we train two similar neural nets on the classic Titanic dataset using techniques from fastbook
chapter 1 and chapter 4.
The first, we train using mostly PyTorch APIs. The second, with FastAI APIs. There are a few cells that output warnings. I kept those because I wanted to preserve print outs of the models’ accuracy.
The Titanic data set can be downloaded from the link above or with:
!kaggle competitions download -c titanic
To start, we install and import the dependencies we’ll need:
%pip install torch pandas scikit-learn fastai
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from fastai.tabular.all import *
from sklearn.preprocessing import StandardScaler
Next, we import the training data
df = pd.read_csv('titanic/train.csv')
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
X = df[features].copy()
y = df['Survived'].copy()
X.head(5)
Pclass | Sex | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | female | 38.0 | 1 | 0 | 71.2833 |
2 | 3 | female | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | female | 35.0 | 1 | 0 | 53.1000 |
4 | 3 | male | 35.0 | 0 | 0 | 8.0500 |
Now, we define two functions to normalize and fill in holes in the data so we can train on it.
def process_training_data(X):
X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})
X['Age'] = X['Age'].fillna(X['Age'].median())
X['Fare'] = X['Fare'].fillna(X['Fare'].median())
return X
def process_test_data(X):
X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})
return X
X = process_training_data(X)
X.head(5)
Pclass | Sex | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
0 | 3 | 0 | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 |
2 | 3 | 1 | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 |
4 | 3 | 0 | 35.0 | 0 | 0 | 8.0500 |
We need to scale the numeric values to be between 0 and 1, otherwise we’ll get
RuntimeError: all elements of input should be between 0 and 1
We’ll do this with StandardScaler
for the both the training and test data, per Sonnet’s recommendation.
StandardScaler
doesn’t actually constrain the data between 0 and 1 but it seems to get the job done for the needs of the model architecture I selected.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
test_df = pd.read_csv('titanic/test.csv')
X_test = test_df[features].copy()
X_test = process_test_data(X_test)
X_test_scaled = scaler.transform(X_test)
y_test_df = pd.read_csv('titanic/gender_submission.csv')
y_test = y_test_df['Survived']
Turn these numpy
arrays into PyTorch tensors and define the model architecture.
X_train_tensor = torch.FloatTensor(X_scaled)
y_train_tensor = torch.FloatTensor(y.values)
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_test_tensor = torch.FloatTensor(y_test.values)
model = nn.Sequential(
nn.Linear(6, 8),
nn.ReLU(),
nn.Linear(8, 1),
nn.Sigmoid()
)
Also, define a loss function and an optimizer:
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
Finally, we can train the model. Sonnet wrote this code.
num_epochs = 1000
batch_size = 64
for epoch in range(num_epochs):
for i in range(0, len(X_train_tensor), batch_size):
batch_X = X_train_tensor[i:i+batch_size]
batch_y = y_train_tensor[i:i+batch_size]
outputs = model(batch_X)
loss = criterion(outputs, batch_y.unsqueeze(1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 100 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
Epoch [100/1000], Loss: 0.3562
Epoch [200/1000], Loss: 0.3216
Epoch [300/1000], Loss: 0.3113
Epoch [400/1000], Loss: 0.3065
Epoch [500/1000], Loss: 0.3038
Epoch [600/1000], Loss: 0.3024
Epoch [700/1000], Loss: 0.2996
Epoch [800/1000], Loss: 0.2975
Epoch [900/1000], Loss: 0.2955
Epoch [1000/1000], Loss: 0.2937
With the model trained, we can run inference on the test set and compare the results to the “Survived” column in the test set from gender_submission.csv
.
model.eval()
with torch.no_grad():
y_pred = model(X_test_tensor)
y_pred_class = (y_pred > 0.5).float()
correct_predictions = (y_pred_class == y_test_tensor.unsqueeze(1)).sum().item()
total_predictions = len(y_test_tensor)
acc = correct_predictions / total_predictions
print(f"Correct predictions: {correct_predictions} out of {total_predictions}")
print(f"Accuracy: {acc:.2%}")
Correct predictions: 368 out of 418
Accuracy: 88.04%
Now, let’s build what I think is a similar model with fastai
primitives.
Load the data again to avoid any unintentional contamination.
train_df = pd.read_csv('titanic/train.csv')
test_df = pd.read_csv('titanic/test.csv')
The TabularDataLoaders
from fastai
needs the following configuration to create DataLoaders
.
cat_names
: the names of the categorical variablescont_names
: the names of the continuous variablesy_names
: the names of the dependent variables
cat_names = ['Pclass', 'Sex']
cont_names = ['Age', 'SibSp', 'Parch', 'Fare']
dep_var = 'Survived'
Following a pattern similar to the one used in chapter 1, we train the model:
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(
train_df,
path='.',
procs=procs,
cat_names=cat_names,
cont_names=cont_names,
y_names=dep_var,
valid_pct=0.2,
seed=42,
bs=64,
)
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(5, 1e-2)
/Users/danielcorin/dev/lab/fastbook_projects/sgd_titanic/.venv/lib/python3.12/site-packages/fastai/tabular/core.py:314: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
to[n].fillna(self.na_dict[n], inplace=True)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.486258 | 0.233690 | 0.662921 | 00:02 |
1 | 0.378460 | 0.192642 | 0.662921 | 00:00 |
2 | 0.294309 | 0.132269 | 0.662921 | 00:00 |
3 | 0.248516 | 0.140377 | 0.662921 | 00:00 |
4 | 0.220335 | 0.132353 | 0.662921 | 00:00 |
For some reason, learn.dls.test_dl
does not apply FillMissing
, for the ‘Fare` column of the test data, so we do that manually here.
test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].median())
We run the test set through the model, then compare the results to the ground truth labels and calculate the model accuracy.
test_dl = learn.dls.test_dl(test_df)
preds, _ = learn.get_preds(dl=test_dl)
binary_preds = (preds > 0.5).float()
y_test = pd.read_csv('titanic/gender_submission.csv')
correct_predictions = (binary_preds.numpy().flatten() == y_test['Survived']).sum()
total_predictions = len(y_test)
acc = correct_predictions / total_predictions
print(f"Correct predictions: {correct_predictions} out of {total_predictions}")
print(f"Accuracy: {acc:.2%}")
/Users/danielcorin/dev/lab/fastbook_projects/sgd_titanic/.venv/lib/python3.12/site-packages/fastai/tabular/core.py:314: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
to[n].fillna(self.na_dict[n], inplace=True)
Correct predictions: 377 out of 418
Accuracy: 90.19%
The accuracies of the two models are about the same! For a first pass at training neural networks (with plenty of help from Sonnet), I think this went pretty well. If you know things about deep learning, let me know if I made any major mistakes. It’s a bit tough to know if you’re doing things correctly in isolation. I suppose that’s why Kaggle competitions can be useful for learning.