Practical Deep Learning, Lesson 3, Stochastic Gradient Descent on the Titanic Dataset

[TIL] October 18, 2024

course.fast.ai

In this notebook, we train two similar neural nets on the classic Titanic dataset using techniques from fastbook chapter 1 and chapter 4.

The first, we train using mostly PyTorch APIs. The second, with FastAI APIs. There are a few cells that output warnings. I kept those because I wanted to preserve print outs of the models’ accuracy.

The Titanic data set can be downloaded from the link above or with:

!kaggle competitions download -c titanic

To start, we install and import the dependencies we’ll need:

%pip install torch pandas scikit-learn fastai

import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim

from fastai.tabular.all import *
from sklearn.preprocessing import StandardScaler

Next, we import the training data

df = pd.read_csv('titanic/train.csv')

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
X = df[features].copy()
y = df['Survived'].copy()
X.head(5)

	Pclass	Sex	Age	SibSp	Fare
0	3	male	22.0	1	7.2500
1	1	female	38.0	1	71.2833
2	3	female	26.0	0	7.9250
3	1	female	35.0	1	53.1000
4	3	male	35.0	0	8.0500

Now, we define two functions to normalize and fill in holes in the data so we can train on it.

def process_training_data(X):
    X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})
    X['Age'] = X['Age'].fillna(X['Age'].median())
    X['Fare'] = X['Fare'].fillna(X['Fare'].median())

    return X


def process_test_data(X):
    X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})

    return X

X = process_training_data(X)
X.head(5)

	Pclass	Sex	Age	SibSp	Fare
0	3	0	22.0	1	7.2500
1	1	1	38.0	1	71.2833
2	3	1	26.0	0	7.9250
3	1	1	35.0	1	53.1000
4	3	0	35.0	0	8.0500

We need to scale the numeric values to be between 0 and 1, otherwise we’ll get

RuntimeError: all elements of input should be between 0 and 1

We’ll do this with StandardScaler for the both the training and test data, per Sonnet’s recommendation. StandardScaler doesn’t actually constrain the data between 0 and 1 but it seems to get the job done for the needs of the model architecture I selected.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

test_df = pd.read_csv('titanic/test.csv')
X_test = test_df[features].copy()
X_test = process_test_data(X_test)
X_test_scaled = scaler.transform(X_test)
y_test_df = pd.read_csv('titanic/gender_submission.csv')
y_test = y_test_df['Survived']

Turn these numpy arrays into PyTorch tensors and define the model architecture.

X_train_tensor = torch.FloatTensor(X_scaled)
y_train_tensor = torch.FloatTensor(y.values)
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_test_tensor = torch.FloatTensor(y_test.values)

model = nn.Sequential(
    nn.Linear(6, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()
)

Also, define a loss function and an optimizer:

criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

Finally, we can train the model. Sonnet wrote this code.

num_epochs = 1000
batch_size = 64

for epoch in range(num_epochs):
    for i in range(0, len(X_train_tensor), batch_size):
        batch_X = X_train_tensor[i:i+batch_size]
        batch_y = y_train_tensor[i:i+batch_size]

        outputs = model(batch_X)
        loss = criterion(outputs, batch_y.unsqueeze(1))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

Epoch [100/1000], Loss: 0.3562
Epoch [200/1000], Loss: 0.3216
Epoch [300/1000], Loss: 0.3113
Epoch [400/1000], Loss: 0.3065
Epoch [500/1000], Loss: 0.3038
Epoch [600/1000], Loss: 0.3024
Epoch [700/1000], Loss: 0.2996
Epoch [800/1000], Loss: 0.2975
Epoch [900/1000], Loss: 0.2955
Epoch [1000/1000], Loss: 0.2937

With the model trained, we can run inference on the test set and compare the results to the “Survived” column in the test set from gender_submission.csv.

model.eval()
with torch.no_grad():
    y_pred = model(X_test_tensor)
    y_pred_class = (y_pred > 0.5).float()
    correct_predictions = (y_pred_class == y_test_tensor.unsqueeze(1)).sum().item()
    total_predictions = len(y_test_tensor)
    acc = correct_predictions / total_predictions
    print(f"Correct predictions: {correct_predictions} out of {total_predictions}")
    print(f"Accuracy: {acc:.2%}")

Test Accuracy: 0.8804

Now, let’s build what I think is a similar model with fastai primitives. Load the data again to avoid any unintentional contamination.

train_df = pd.read_csv('titanic/train.csv')
test_df = pd.read_csv('titanic/test.csv')

The TabularDataLoaders from fastai needs the following configuration to create DataLoaders.

cat_names: the names of the categorical variables
cont_names: the names of the continuous variables
y_names: the names of the dependent variables

cat_names = ['Pclass', 'Sex']
cont_names = ['Age', 'SibSp', 'Parch', 'Fare']
dep_var = 'Survived'

Following a pattern similar to the one used in chapter 1, we train the model:

procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(
    train_df,
    path='.',
    procs=procs,
    cat_names=cat_names,
    cont_names=cont_names,
    y_names=dep_var,
    valid_pct=0.2,
    seed=42,
    bs=64,
)

learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(5, 1e-2)

/Users/danielcorin/dev/lab/fastbook_projects/sgd_titanic/.venv/lib/python3.12/site-packages/fastai/tabular/core.py:314: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)

epoch	train_loss	valid_loss	accuracy	time
0	0.486258	0.233690	0.662921	00:02
1	0.378460	0.192642	0.662921	00:00
2	0.294309	0.132269	0.662921	00:00
3	0.248516	0.140377	0.662921	00:00
4	0.220335	0.132353	0.662921	00:00

For some reason, learn.dls.test_dl does not apply FillMissing, for the ‘Fare` column of the test data, so we do that manually here.

test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].median())

We run the test set through the model, then compare the results to the ground truth labels and calculate the model accuracy.

test_dl = learn.dls.test_dl(test_df)
preds, _ = learn.get_preds(dl=test_dl)

binary_preds = (preds > 0.5).float()

y_test = pd.read_csv('titanic/gender_submission.csv')
correct_predictions = (binary_preds.numpy().flatten() == y_test['Survived']).sum()
total_predictions = len(y_test)

acc = correct_predictions / total_predictions

print(f"Correct predictions: {correct_predictions} out of {total_predictions}")
print(f"Accuracy: {acc:.2%}")

/Users/danielcorin/dev/lab/fastbook_projects/sgd_titanic/.venv/lib/python3.12/site-packages/fastai/tabular/core.py:314: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)

Correct predictions: 377 out of 418
Accuracy: 90.19%

The accuracies of the two models are about the same! For a first pass at training neural networks (with plenty of help from Sonnet), I think this went pretty well. If you know things about deep learning, let me know if I made any major mistakes. It’s a bit tough to know if you’re doing things correctly in isolation. I suppose that’s why Kaggle competitions can be useful for learning.

✎ Edit

Raw