Having completed lesson 5 of the FastAI course, I prompted Claude to give me some good datasets upon which to train a random forest model. This housing dataset from Kaggle seemed like a nice option, so I decided to give it a try. I am also going to try something that Jeremy Howard recommended for this notebook/post, which is to not refine or edit my process very much. I am mostly going to try things out and if they don’t work, I’ll try and write up why and continue, rather than finding a working path and adding commentary at the end.

To start, let’s turn off warnings (for a cleaner notebook) and get the dataset from Kaggle.

import warnings
warnings.simplefilter('ignore', FutureWarning)
!pip install kagglehub
!pip install --upgrade pip
!pip install fastai scikit-learn
import kagglehub

path = kagglehub.dataset_download("emurphy/ames-iowa-housing-prices-dataset")

print("Path to dataset files:", path)
Path to dataset files: /Users/danielcorin/.cache/kagglehub/datasets/emurphy/ames-iowa-housing-prices-dataset/versions/1

Look at the data#

We’re going to look at the data first to try and understand it a bit better

import pandas as pd

df = pd.read_csv(str(path) + '/train1.csv', low_memory=False)
df.columns
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

Lots of categorical variables. Here are a few examples:

df['Fence'].unique()
array([nan, 'MnPrv', 'GdWo', 'GdPrv', 'MnWw'], dtype=object)
df['LandSlope'].unique()
array(['Gtl', 'Mod', 'Sev'], dtype=object)
df['HouseStyle'].unique()
array(['2Story', '1Story', '1.5Fin', '1.5Unf', 'SFoyer', 'SLvl', '2.5Unf',
       '2.5Fin'], dtype=object)

But also several continuous variables

df['GarageArea'].hist()
<Axes: >

png

df['GrLivArea'].hist()
<Axes: >

png

df[['1stFlrSF', '2ndFlrSF']].hist()
array([[<Axes: title={'center': '1stFlrSF'}>,
        <Axes: title={'center': '2ndFlrSF'}>]], dtype=object)

png

Since I’m not familiar with other approaches yet, I am going to use root mean square log error because that is what the Chapter 9 notebook uses. If this approach doesn’t produce an effective model, I’ll investigate other approaches.

dep_var = 'SalePrice'
df[dep_var]
0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64
import numpy as np

df[dep_var] = np.log(df[dep_var])
df[dep_var]
0       12.247694
1       12.109011
2       12.317167
3       11.849398
4       12.429216
          ...    
1455    12.072541
1456    12.254863
1457    12.493130
1458    11.864462
1459    11.901583
Name: SalePrice, Length: 1460, dtype: float64

Since we’re aiming to predict home sale prices in the future, we should split our training and validation data by date. Let’s look at the range of the training set to figure out how it would make sense to do that.

df[['YrSold', 'MoSold']].hist()
print(f"Year range: {df['YrSold'].min()} - {df['YrSold'].max()}")
Year range: 2006 - 2010

png

Create test and validation sets#

I asked Claude how I might split these into training and validation sets (we’ll see if this is helpful or even a good idea I guess). It recommended a 75%-25% split between test and validation, so let’s loosely do that.

cond = ((df['YrSold'] < 2009) | ((df['YrSold'] == 2009) & (df['MoSold'] >= 8))).values
train_idx = np.where(cond)[0]
valid_idx = np.where(~cond)[0]

splits = (list(train_idx),list(valid_idx))
print(f"Training set size: {len(train_idx)} ({len(train_idx)/len(df):.1%})")
print(f"Validation set size: {len(valid_idx)} ({len(valid_idx)/len(df):.1%})")
Training set size: 1061 (72.7%)
Validation set size: 399 (27.3%)
from fastai.tabular.all import *

cont, cat = cont_cat_split(df, 1, dep_var=dep_var)
procs = [Categorify, FillMissing]
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)

After this data pre-processing, we validate the split is still approximately 75%-25%.

print(f"Training: {len(to.train)} ({len(to.train)/len(df):.1%})")
print(f"Validation: {len(to.valid)} ({len(to.valid)/len(df):.1%})")
Training: 1061 (72.7%)
Validation: 399 (27.3%)

Create a decision tree#

Following a similar approach to chapter 9, we’re going to create a simple decision tree with a maximum of 4 leaf nodes so we can learn a bit about which features influence house sale prices the most.

xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y
from sklearn.tree import DecisionTreeRegressor

m0 = DecisionTreeRegressor(max_leaf_nodes=4)
m0.fit(xs, y);
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(40,20))
plot_tree(m0, feature_names=xs.columns, filled=True, rounded=True, precision=2, fontsize=30)
plt.show()

png

The squared_error looks a bit small but could be because we took the log of the sales prices.

!pip install dtreeviz
import dtreeviz
warnings.filterwarnings('ignore', 'X does not have valid feature names')


samp_idx = np.random.permutation(len(y))[:500]
viz_model = dtreeviz.model(
    m0,
    X_train=xs.iloc[samp_idx],
    y_train=y.iloc[samp_idx],
    feature_names=xs.columns,
    target_name=dep_var,
)

viz_model.view(
    fontname='DejaVu Sans',
    label_fontsize=10,
    orientation='LR',
)
Decision tree visualization

Let’s do the same experimentation in Chapter 9. We’ll build a bigger tree with no stopping criteria.

m1 = DecisionTreeRegressor()
m1.fit(xs, y);
def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)
m_rmse(m1, xs, y)
0.0

A “perfect” model fit on the train data – suspicious.

m_rmse(m1, valid_xs, valid_y)
0.223525

A pretty poor relative fit on the validation set

m1.get_n_leaves(), len(xs)
(np.int64(1019), 1061)

We’re seeing almost as many leaves as training data points. Let’s try and build a smaller tree.

m2 = DecisionTreeRegressor(min_samples_leaf=25)
m2.fit(to.train.xs, to.train.y)
m_rmse(m2, xs, y), m_rmse(m2, valid_xs, valid_y)
(0.166169, 0.192442)

The root mean square error is a bit improved.

m2.get_n_leaves()
np.int64(33)

And with far fewer leaves! Can a random forest improve our error rate further?

from sklearn.ensemble import RandomForestRegressor

def rf(xs, y, n_estimators, max_samples,
       max_features, min_samples_leaf, **kwargs):
    return RandomForestRegressor(
        n_jobs=-1,
        n_estimators=n_estimators,
        max_samples=max_samples,
        max_features=max_features,
        min_samples_leaf=min_samples_leaf,
        oob_score=True,
    ).fit(xs, y)

Let’s inspect xs to figure out reasonable values for n_estimators, max_samples, max_features and min_samples_leaf.

print(f"n_samples: {len(xs)}")
print(f"n_features: {xs.shape[1]}")
n_samples: 1061
n_features: 83

Given these, I asked Claude for some reasonable values for these hyperparameters.

n_estimators=100: Sweet spot between performance and training time

max_samples=33 (sqrt(1061))

max_features=9 (sqrt(83))

min_samples_leaf=11 (~1% of 1061)

I briefly looked to corroborate these suggestions in papers, but wasn’t able to quickly do so. I recognize these recommendations may be bad or wrong, but for learning purposes, I am going to proceed and we will see how it goes.

m3 = rf(xs, y, n_estimators=100,
        max_samples=33, max_features=9,
        min_samples_leaf=11)
m_rmse(m3, xs, y), m_rmse(m3, valid_xs, valid_y)
(0.279821, 0.302108)

Interesting. Worse results than our trees.

As the chapter does, let’s plot r_mse as n_estimators varies in our model up to 125 to see if there is anything we can do to improve.

preds = np.stack([t.predict(valid_xs) for t in m3.estimators_])
plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(125)]);

png

I am going to chose 20 n_estimators because that looks like something close to the minimum.

m4 = rf(xs, y, n_estimators=20,
        max_samples=33, max_features=9,
        min_samples_leaf=11)
m_rmse(m4, xs, y), m_rmse(m4, valid_xs, valid_y)
(0.280595, 0.303637)

This hyperparameter change doesn’t seem to improve the random forest model too much though. Let’s plot feature importance to see if we can learn more.

def rf_feat_importance(m, df):
    return pd.DataFrame(
        {'cols':df.columns, 'imp':m.feature_importances_}
    ).sort_values('imp', ascending=False)
fi = rf_feat_importance(m1, xs)
fi[:10]

colsimp
50OverallQual0.550081
62GrLivArea0.150868
58TotalBsmtSF0.081219
72GarageCars0.018286
52YearBuilt0.015654
57BsmtUnfSF0.013133
591stFlrSF0.011235
55BsmtFinSF10.011091
602ndFlrSF0.010260
8Neighborhood0.009380
def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(fi[:30]);

png

It’s interesting to see what I would guess is a subjective measure in OverallQual show up as the most impactful feature.

Following the chapter, let’s remove the lower importance features and retrain the model with the pruned feature list.

to_keep = fi[fi.imp>0.005].cols
len(to_keep)
17
xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]
m5 = rf(xs_imp, y, n_estimators=15,
        max_samples=33, max_features=9,
        min_samples_leaf=11)

Now, let’s train a model with the (supposedly) most important 18 features.

m_rmse(m5, xs_imp, y), m_rmse(m5, valid_xs_imp, valid_y)
(0.260041, 0.280743)

This turns out to be a bit of improvement with many fewer columns.

Let’s look at the feature importance again and now and potentially redundant features.

len(xs.columns), len(xs_imp.columns)
(83, 17)
plot_fi(rf_feat_importance(m5, xs_imp));

png

It looks like maybe we could remove even more features? I didn’t expect to see so many features with such low importance.

from scipy.cluster import hierarchy as hc

def cluster_columns(df, figsize=(10,6), font_size=12):
    corr = np.round(scipy.stats.spearmanr(df).correlation, 4)
    corr_condensed = hc.distance.squareform(1-corr)
    z = hc.linkage(corr_condensed, method='average')
    fig = plt.figure(figsize=figsize)
    hc.dendrogram(z, labels=df.columns, orientation='left', leaf_font_size=font_size)
    plt.show()

cluster_columns(xs_imp)

png

Nothing seems obviously duplicative. Let’s drop the remaining features with low importance seen in the feature importance plot, then train another model.

to_drop = [
    'Fence', 'BsmtFinSF1', 'Neighborhood', 'Alley', 'GarageType',
    '2ndFlrSF', 'OverallCond', 'Electrical', 'PavedDrive',
    'LotArea', 'BsmtUnfSF',
]
xs_final = xs_imp.drop(to_drop, axis=1)
valid_xs_final = valid_xs_imp.drop(to_drop, axis=1)
m6 = rf(xs_final, y, n_estimators=15,
        max_samples=33, max_features=9,
        min_samples_leaf=11)
m_rmse(m6, xs_final, y), m_rmse(m6, valid_xs_final, valid_y)
(0.254607, 0.271229)

Dropping these does not seem to make the model much worse.

Run model on the test set#

Let’s load the test set and run our model to see if we’re generally building the right direction.

test_df = pd.read_csv(str(path) + '/test1.csv', low_memory=False)
test_df[dep_var] = np.log(test_df[dep_var])
procs = [Categorify, FillMissing]
test_cont, test_cat = cont_cat_split(test_df, 1, dep_var=dep_var)
test_to = TabularPandas(test_df, procs, test_cat, test_cont, splits=None, y_names=None)

to_keep = xs_final.columns
test_xs = test_to.train.xs[to_keep]
preds = m6.predict(test_xs)
r_mse(preds, test_df[dep_var]), len(to_keep)
(0.22978, 6)

I think this result looks promising. The error is a bit better than what we’ve seen during training.

I had some concerns removing so many of the features, but it does seemed to have improve the model at each turn, at least in training. We can run a few of the older models against the test set to validate we’re actually made improvements.

Comparing to past models#

On a limb, I decided to run the test set through all the models that I trained over the course of this notebook.

First, we run our random forest model after the first round of dropped features.

to_keep = xs_imp.columns
test_xs = test_to.train.xs[to_keep]
preds = m5.predict(test_xs)
r_mse(preds, test_df[dep_var]), len(to_keep)
(0.190995, 17)

Next, we run our random forest with n_estimators hyperparameter tuning

to_keep = xs.columns
test_xs = test_to.train.xs[to_keep]
preds = m4.predict(test_xs)
r_mse(preds, test_df[dep_var]), len(to_keep)
(0.175185, 83)

After, we run our first random forest.

to_keep = xs.columns
test_xs = test_to.train.xs[to_keep]
preds = m3.predict(test_xs)
r_mse(preds, test_df[dep_var]), len(to_keep)
(0.175589, 83)

And finally, we run our first three tree models of varied max_leaf_nodes counts.

to_keep = xs.columns
test_xs = test_to.train.xs[to_keep]
preds = m2.predict(test_xs)
r_mse(preds, test_df[dep_var]), len(to_keep)
(0.359379, 83)
to_keep = xs.columns
test_xs = test_to.train.xs[to_keep]
preds = m1.predict(test_xs)
r_mse(preds, test_df[dep_var]), len(to_keep)
(0.412504, 83)
to_keep = xs.columns
test_xs = test_to.train.xs[to_keep]
preds = m0.predict(test_xs)
r_mse(preds, test_df[dep_var]), len(to_keep)
(0.315114, 83)

It looks like the fourth model (m3) we trained (our first random forest), which still considers all 83 of the original columns actually performs a bit better than our seventh model (m6) where we pruned many columns that didn’t seem to have much importance. However, given the number of columns we managed to prune, maybe it could make sense use the model with fewer considerations – I’m not quite sure. Maybe we could learn more by monitoring the model and retraining with new data in the future.

Takeaways#

I’m quite glad I kept a running log in the notebook as I experimented. It made it very easy to go back and check my test set against previous model iterations.

So much experimentation is involved in building models and a notebook is a very useful tool that empowers the experimentation. I’m not sure how I could do ML without them.

Claude continues to be an invaluable assistant for answering questions about my approach and writing little snippets of pandas to help me validate I am (hopefully) on the right track.

I’m still not really sure what my results mean! My model seemed to perform reasonably well against the test set. This (again) is where a Kaggle competition could help with a leaderboard so I could have some relative sense of how good my approach is. Right now, I just don’t know. I’ve been told this feeling may never go away 🫠.