In this notebook, we’ll use the MovieLens 10M dataset and collaborative filtering to create a movie recommendation model. We’ll use the data from movies.dat and ratings.dat to create embeddings that will help us predict ratings for movies I haven’t watched yet.

Create some personal data#

Before I wrote any code to train models, I code-generated a quick UI to rate movies to generate my_ratings.dat, to append to ratings.dat. There is a bit of code needed to do that. The nice part is using inline script metadata and uv, we can write (generate) and run the whole tool in a single file.

Here is the code:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "fastapi",
#     "pandas",
#     "uvicorn",
# ]
# ///

from fastapi import FastAPI
from fastapi.responses import JSONResponse, HTMLResponse
import pandas as pd
from datetime import datetime
import uvicorn

app = FastAPI()

movies_df = pd.read_csv(
    "ml-10M100K/movies.dat",
    sep="::",
    names=["movie_id", "title", "genres"],
    engine="python",
)
movies_df["year"] = movies_df["title"].str.extract(r"\((\d{4})\)")
movies_df["title"] = movies_df["title"].str.replace(r"\s*\(\d{4}\)", "", regex=True)
movies_df = movies_df.sort_values("year", ascending=False)

last_rated_index = 0


@app.get("/", response_class=HTMLResponse)
async def get_root():
    return """
    <!DOCTYPE html>
    <html>
    <head>
        <title>Movie Ratings</title>
        <style>
            body { font-family: sans-serif; max-width: 800px; margin: 0 auto; padding: 20px; }
            .movie { margin-bottom: 20px; padding: 20px; border: 1px solid #ccc; border-radius: 5px; }
            .rating-buttons { margin-top: 10px; }
            button { margin-right: 5px; padding: 5px 10px; cursor: pointer; }
            .rating-btn { background: #4CAF50; color: white; border: none; }
            .skip-btn { background: #f44336; color: white; border: none; }
        </style>
    </head>
    <body>
        <div id="current-movie" class="movie">
            <h2 id="movie-title"></h2>
            <p>Year: <span id="movie-year"></span></p>
            <p>Genres: <span id="movie-genres"></span></p>
            <div class="rating-buttons">
                <button class="rating-btn" onclick="rateMovie(1)">1★</button>
                <button class="rating-btn" onclick="rateMovie(2)">2★</button>
                <button class="rating-btn" onclick="rateMovie(3)">3★</button>
                <button class="rating-btn" onclick="rateMovie(4)">4★</button>
                <button class="rating-btn" onclick="rateMovie(5)">5★</button>
                <button class="skip-btn" onclick="skipMovie()">Skip</button>
            </div>
        </div>

        <script>
            let currentMovie = null;

            async function loadNextMovie() {
                const response = await fetch('/next-movie');
                currentMovie = await response.json();
                document.getElementById('movie-title').textContent = currentMovie.title;
                document.getElementById('movie-year').textContent = currentMovie.year;
                document.getElementById('movie-genres').textContent = currentMovie.genres;
            }

            async function rateMovie(rating) {
                if (!currentMovie) return;
                await fetch(`/rate-movie/${currentMovie.movie_id}/${rating}`, {
                    method: 'POST'
                });
                loadNextMovie();
            }

            async function skipMovie() {
                if (!currentMovie) return;
                await fetch(`/skip-movie/${currentMovie.movie_id}`, {
                    method: 'POST'
                });
                loadNextMovie();
            }

            loadNextMovie();
        </script>
    </body>
    </html>
    """


@app.get("/next-movie")
async def get_next_movie():
    global last_rated_index
    movie = movies_df.iloc[last_rated_index].to_dict()
    return JSONResponse(movie)


@app.post("/rate-movie/{movie_id}/{rating}")
async def rate_movie(movie_id: int, rating: int):
    global last_rated_index
    if rating not in range(1, 6):
        return JSONResponse(
            {"error": "Rating must be between 1 and 5"}, status_code=400
        )

    timestamp = int(datetime.now().timestamp())
    user_id = 99999

    with open("my_ratings.dat", "a") as f:
        f.write(f"{user_id}::{movie_id}::{rating}::{timestamp}\n")

    last_rated_index += 1
    with open("last_rated.txt", "w") as f:
        f.write(str(last_rated_index))

    return JSONResponse({"status": "success"})


@app.post("/skip-movie/{movie_id}")
async def skip_movie(movie_id: int):
    global last_rated_index
    last_rated_index += 1
    with open("last_rated.txt", "w") as f:
        f.write(str(last_rated_index))
    return JSONResponse({"status": "success"})


if __name__ == "__main__":
    uvicorn.run(app, host="127.0.0.1", port=8000)

which can be run with

uv run app.py

When run, the tool looks like this.

Screenshot of movie rating tool

Load the data#

With around 40 movies rated and saved in my_ratings.dat, let’s install fastai, suppress warnings to make the notebook cleaner and import the libraries we’ll need to train the model.

!pip install fastai
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
from fastai.collab import *
from fastai.tabular.all import *

user_id = 99999

Looking at the README for the dataset, we see it has the following structure

MovieID::Title::Genres 

We can import that as a csv

movies = pd.read_csv('ml-10M100K/movies.dat', sep='::', names=['id', 'name', 'genre'])
movies['year'] = movies['name'].str.extract(r'\((\d{4})\)')
movies.head()

idnamegenreyear
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy1995
12Jumanji (1995)Adventure|Children|Fantasy1995
23Grumpier Old Men (1995)Comedy|Romance1995
34Waiting to Exhale (1995)Comedy|Drama|Romance1995
45Father of the Bride Part II (1995)Comedy1995

Next, we load the ratings from the dataset and concatenate them with the ratings I created so that I could generate predictions for myself (user id 99999).

ratings = pd.concat([
    pd.read_csv('ml-10M100K/ratings.dat', sep='::', names=['userId', 'movieId', 'rating', 'timestamp']),
    pd.read_csv('ml-10M100K/my_ratings.dat', sep='::', names=['userId', 'movieId', 'rating', 'timestamp'])
])
ratings.tail()

userIdmovieIdratingtimestamp
3599999465783.01734831045
3699999441915.01734831168
3799999408154.01734831310
3899999307933.01734831332
3999999358364.01734831347

Once we load the ratings, we can check the ratings distribution to validate it seems diverse enough to be a good dataset.

ratings['rating'].hist(bins=20, figsize=(10,6))
plt.title('Distribution of Movie Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

png

Train some models#

Now we’ll lean heavily on fastai and create and train a collaborative learner from the ratings data, training it for 5 epochs. This process will likely take some time.

dls = CollabDataLoaders.from_df(
    ratings,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=256
)
dls.show_batch()
userIdmovieIdrating
05195941023.5
114853773.5
2387787335.0
3896016355.0
41510712543.0
523922465.0
636036324.0
72878837513.5
86943935471.0
94364103.0
learner = collab_learner(
    dls,
    n_factors=20,
    y_range=(0.5, 5.5)
)

learner.fit_one_cycle(3)
epochtrain_lossvalid_losstime
00.7172910.73370802:19
10.6521980.68730702:20
20.6326530.67698102:23

Let’s save a checkpoint of the model, then reload it and run two more epochs.

learner.save('collab_model_20_factors_256_bs')
Path('models/collab_model_20_factors_256_bs.pth')
learner = learner.load('collab_model_20_factors_256_bs')
learner.fit_one_cycle(2)
epochtrain_lossvalid_losstime
00.6026600.67039902:21
10.5869950.65835102:20

It looks like the loss doesn’t improve too much with the additional two training epochs, which is good to know for future model training.

I’m intentionally keeping the training time relatively fast at this point. I want to be able to get a feel for how training these types of models works. Once I get a better sense of that, I’ll increase things like n_factors and training epochs.

Now that we’ve trained the model, let’s get movie recommendations for me – user 99999. To do this, we’ll predict ratings for all movies and sort them by highest predicted rating. These values are what the model thinks we’ll rate these movies given our rating history.

def get_preds(learner, user_id=99999, num_recs=20):
    all_movies = pd.DataFrame({'userId': [user_id] * len(movies), 'movieId': movies['id']})
    preds = learner.get_preds(dl=learner.dls.test_dl(all_movies))[0].numpy()

    recommendations = pd.DataFrame({
        'movie_id': movies['id'],
        'title': movies['name'],
        'year': movies['year'],
        'predicted_rating': preds
    })

    return recommendations.sort_values('predicted_rating', ascending=False).head(num_recs)

recommendations = get_preds(learner)
recommendations

movie_idtitleyearpredicted_rating
293296Pulp Fiction (1994)19944.727180
4950Usual Suspects, The (1995)19954.673663
315318Shawshank Redemption, The (1994)19944.528549
843858Godfather, The (1972)19724.521612
24872571Matrix, The (1999)19994.423033
41344226Memento (2000)20004.410904
11731198Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)19814.393826
48994993Lord of the Rings: The Fellowship of the Ring, The (2001)20014.388923
28742959Fight Club (1999)19994.380060
1021658559Dark Knight, The (2008)20084.371044
70397153Lord of the Rings: The Return of the King, The (2003)20034.354837
11951221Godfather: Part II, The (1974)19744.352811
10671089Reservoir Dogs (1992)19924.338005
58525952Lord of the Rings: The Two Towers, The (2002)20024.321620
11131136Monty Python and the Holy Grail (1975)19754.311879
11711196Star Wars: Episode V - The Empire Strikes Back (1980)19804.311409
257260Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)19774.301165
523527Schindler's List (1993)19934.293892
59166016City of God (Cidade de Deus) (2002)20024.267132
11871213Goodfellas (1990)19904.253361

These recommendations seem pretty good. I’ve actually seen and liked some of these movies the model is recommending. Something I didn’t expect is also happening. The model is generating predictions for movies I’ve already rated. Movie id 58559 is already in my_ratings.dat

ratings[(ratings['userId'] == 99999) & (ratings['movieId'] == 58559)]

userIdmovieIdratingtimestamp
099999585595.01734829873
recommendations[recommendations['movie_id'] == 58559]

movie_idtitleyearpredicted_rating
1021658559Dark Knight, The (2008)20084.371044

Let’s modify get_preds to filter these duplicates out

def get_preds(learner, ratings, user_id=99999, num_recs=20):
    all_movies = pd.DataFrame({'userId': [user_id] * len(movies), 'movieId': movies['id']})
    preds = learner.get_preds(dl=learner.dls.test_dl(all_movies))[0].numpy()

    recommendations = pd.DataFrame({
        'movie_id': movies['id'],
        'title': movies['name'],
        'year': movies['year'],
        'predicted_rating': preds
    })

    rated_movies = ratings[ratings['userId'] == user_id]['movieId'].values

    recommendations = recommendations[~recommendations['movie_id'].isin(rated_movies)]

    return recommendations.sort_values('predicted_rating', ascending=False).head(num_recs)

recommendations = get_preds(learner, ratings)
recommendations

movie_idtitleyearpredicted_rating
293296Pulp Fiction (1994)19944.727180
4950Usual Suspects, The (1995)19954.673663
315318Shawshank Redemption, The (1994)19944.528549
843858Godfather, The (1972)19724.521612
24872571Matrix, The (1999)19994.423033
41344226Memento (2000)20004.410904
11731198Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)19814.393826
48994993Lord of the Rings: The Fellowship of the Ring, The (2001)20014.388923
28742959Fight Club (1999)19994.380060
70397153Lord of the Rings: The Return of the King, The (2003)20034.354837
11951221Godfather: Part II, The (1974)19744.352811
10671089Reservoir Dogs (1992)19924.338005
58525952Lord of the Rings: The Two Towers, The (2002)20024.321620
11131136Monty Python and the Holy Grail (1975)19754.311879
11711196Star Wars: Episode V - The Empire Strikes Back (1980)19804.311409
257260Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)19774.301165
523527Schindler's List (1993)19934.293892
59166016City of God (Cidade de Deus) (2002)20024.267132
11871213Goodfellas (1990)19904.253361
11721197Princess Bride, The (1987)19874.251543

Now we’re getting clean predictions.

Since I’ve seen some of these movies (but haven’t added ratings for them yet), it would be nice to do that so I can generate new recommendations with additional data. Given the way collaborative filtering works, we’d need to retrain the model with the augmented ratings dataset.

I could just filter out the IDs of the recommendations I’ve already watched and work through the existing recommendations/predictions list. However, these would still only take into account the original ratings I trained the model on for my user, which means we’re not making great use of the data.

It would be nice to retrain the model on this augmented data, simulating what real system retraining could look like.

import time

def add_new_ratings(user_id, new_ratings):
    new_ratings_df = pd.DataFrame({
        'userId': [user_id] * len(new_ratings),
        'movieId': [x[0] for x in new_ratings],
        'rating': [x[1] for x in new_ratings],
        'timestamp': [int(time.time())] * len(new_ratings)
    })
    return new_ratings_df

new_ratings = [
    (318, 4.5), # Shawshank Redemption
    (50, 4), # The Usual Suspects
    (4226, 4.5), # Memento
]

new_ratings_df = add_new_ratings(user_id, new_ratings)
new_ratings_df

userIdmovieIdratingtimestamp
0999993184.51735055101
199999504.01735055101
29999942264.51735055101
ratings2 = pd.concat([ratings, new_ratings_df], ignore_index=True)
ratings2.tail(5)

userIdmovieIdratingtimestamp
1000009299999307933.01734831332
1000009399999358364.01734831347
10000094999993184.51735055101
1000009599999504.01735055101
100000969999942264.51735055101

We validate our new ratings have been added, then train a new model with 3 epochs this time (because I am impatient).

dls = CollabDataLoaders.from_df(
    ratings2,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=256,
)

learner2 = collab_learner(
    dls,
    n_factors=20,
    y_range=(0.5, 5.5)
)
learner2.fit_one_cycle(3)
epochtrain_lossvalid_losstime
00.6981860.73475602:15
10.6408880.68644102:16
20.6208030.67616202:16
get_preds(learner2, ratings2)

movie_idtitleyearpredicted_rating
843858Godfather, The (1972)19724.458255
523527Schindler's List (1993)19934.407414
732745Wallace & Gromit: A Close Shave (1995)19954.385326
660668Pather Panchali (1955)19554.381900
708720Wallace & Gromit: The Best of Aardman Animation (1996)19964.373108
11251148Wallace & Gromit: The Wrong Trousers (1993)19934.368431
895912Casablanca (1942)19424.365451
661670World of Apu, The (Apur Sansar) (1959)19594.364862
48794973Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)20014.358409
293296Pulp Fiction (1994)19944.358396
59166016City of God (Cidade de Deus) (2002)20024.351150
11691193One Flew Over the Cuckoo's Nest (1975)19754.349298
946844555Lives of Others, The (Das Leben der Anderen) (2006)20064.344989
29373022General, The (1927)19274.336016
19352019Seven Samurai (Shichinin no samurai) (1954)19544.329831
887904Rear Window (1954)19544.318365
737750Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)19644.311434
29453030Yojimbo (1961)19614.303742
11951221Godfather: Part II, The (1974)19744.294334
1178120312 Angry Men (1957)19574.291220

We see these ratings are a bit different and not just that the three ratings I added are removed from the predictions. These predictions seem ok, but the movies skew a bit older than I typically like. It’s hard to explain quantitatively, but I think we can improve on this.

With the model training process down reasonably well, we’re going to train a model with more factors to see how the predictions change and hopefully improve.

dls = CollabDataLoaders.from_df(
    ratings2,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=256,
)

learner3 = collab_learner(
    dls,
    n_factors=100,
    y_range=(0.5, 5.5)
)
learner3.fit_one_cycle(3)
epochtrain_lossvalid_losstime
00.6536200.70544003:37
10.5155550.66277203:44
20.4942220.65426003:42

This model ended up training faster than I expected. Here are the recommendations:

recommendations = get_preds(learner3, ratings2)
recommendations

movie_idtitleyearpredicted_rating
843858Godfather, The (1972)19724.215202
22402324Life Is Beautiful (La Vita è bella) (1997)19974.161647
523527Schindler's List (1993)19934.159183
946844555Lives of Others, The (Das Leben der Anderen) (2006)20064.143415
293296Pulp Fiction (1994)19944.143309
22452329American History X (1998)19984.124294
59166016City of God (Cidade de Deus) (2002)20024.122414
11711196Star Wars: Episode V - The Empire Strikes Back (1980)19804.106239
257260Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)19774.096898
895912Casablanca (1942)19424.092230
11491172Cinema Paradiso (Nuovo cinema Paradiso) (1989)19894.082796
16451704Good Will Hunting (1997)19974.076812
290293Léon: The Professional (Léon) (Professional, The) (1994)19944.039810
48794973Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)20014.038737
11951221Godfather: Part II, The (1974)19744.038615
27732858American Beauty (1999)19994.035391
28742959Fight Club (1999)19994.030475
11721197Princess Bride, The (1987)19874.022649
64276539Pirates of the Caribbean: The Curse of the Black Pearl (2003)20034.022105
12071234Sting, The (1973)19734.010592

The predictions are pretty similar to the previous models. It seems like I misunderstood the meaning of bs (batch size) which when larger, can get stuck in local minima. The fastai bs default is 64, so let’s try training another model with that.

dls = CollabDataLoaders.from_df(
    ratings2,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=64,
)

learner4 = collab_learner(
    dls,
    n_factors=100,
    y_range=(0.5, 5.5)
)
learner4.fit_one_cycle(3)
epochtrain_lossvalid_losstime
00.6704240.71830815:11
10.5523180.67810215:23
20.5209570.66305315:25

Forty five minutes later and our model is trained. Let’s see how we did.

recommendations = get_preds(learner4, ratings2)
recommendations

movie_idtitleyearpredicted_rating
293296Pulp Fiction (1994)19944.540648
28742959Fight Club (1999)19994.533676
843858Godfather, The (1972)19724.420893
24872571Matrix, The (1999)19994.418619
11951221Godfather: Part II, The (1974)19744.376376
59166016City of God (Cidade de Deus) (2002)20024.362540
22452329American History X (1998)19984.313848
19352019Seven Samurai (Shichinin no samurai) (1954)19544.291965
587593Silence of the Lambs, The (1991)19914.287333
4647Seven (a.k.a. Se7en) (1995)19954.286267
70397153Lord of the Rings: The Return of the King, The (2003)20034.276011
10671089Reservoir Dogs (1992)19924.256866
72477361Eternal Sunshine of the Spotless Mind (2004)20044.256151
48994993Lord of the Rings: The Fellowship of the Ring, The (2001)20014.251918
58525952Lord of the Rings: The Two Towers, The (2002)20024.243845
67626874Kill Bill: Vol. 1 (2003)20034.231368
15641617L.A. Confidential (1997)19974.224444
22402324Life Is Beautiful (La Vita è bella) (1997)19974.212670
912733794Batman Begins (2005)20054.211322
523527Schindler's List (1993)19934.199523

Qualitatively, I think these predictions are an improvement. The movies are a bit newer than surfaced by the other models and there are more in this list that either I’ve already seen or that have been recommended to me by friends.

Let’s see if we can reduce n_factors and still get what appear to be good results, as this should speed up training.

My goal is to find a balance of prediction quality and speed of training that makes it reasonable to retrain a model whenever I update my ratings list. It’s possible faster training will sacrifice quality too much on my hardware but let’s see.

dls = CollabDataLoaders.from_df(
    ratings2,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=64,
)

learner5 = collab_learner(
    dls,
    n_factors=50,
    y_range=(0.5, 5.5)
)
learner5.fit_one_cycle(3)
epochtrain_lossvalid_losstime
00.6793600.72624410:14
10.6246190.68221610:10
20.5804840.66821310:08
recommendations = get_preds(learner5, ratings2)
recommendations

movie_idtitleyearpredicted_rating
70397153Lord of the Rings: The Return of the King, The (2003)20034.511814
58525952Lord of the Rings: The Two Towers, The (2002)20024.485892
108110Braveheart (1995)19954.471653
48994993Lord of the Rings: The Fellowship of the Ring, The (2001)20014.465234
257260Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)19774.461960
24872571Matrix, The (1999)19994.458735
11711196Star Wars: Episode V - The Empire Strikes Back (1980)19804.448349
11731198Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)19814.401854
843858Godfather, The (1972)19724.367479
523527Schindler's List (1993)19934.364442
11841210Star Wars: Episode VI - Return of the Jedi (1983)19834.358963
19442028Saving Private Ryan (1998)19984.299438
11951221Godfather: Part II, The (1974)19744.241967
22452329American History X (1998)19984.237298
11131136Monty Python and the Holy Grail (1975)19754.220087
34893578Gladiator (2000)20004.213627
26772762Sixth Sense, The (1999)19994.209116
912733794Batman Begins (2005)20054.207511
28742959Fight Club (1999)19994.203415
587593Silence of the Lambs, The (1991)19914.195570

These seem reasonably similar. Not so surprisingly if you haven’t seen Lord of the Rings or Star Wars, the model thinks you should.

Do things change with further training?

learner5.fit_one_cycle(2)
epochtrain_lossvalid_losstime
00.5848990.68662110:07
10.5752250.66472110:07
recommendations = get_preds(learner5, ratings2)
recommendations

movie_idtitleyearpredicted_rating
843858Godfather, The (1972)19724.525020
11731198Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)19814.468911
11951221Godfather: Part II, The (1974)19744.464158
108110Braveheart (1995)19954.433938
11711196Star Wars: Episode V - The Empire Strikes Back (1980)19804.393318
19442028Saving Private Ryan (1998)19984.367393
70397153Lord of the Rings: The Return of the King, The (2003)20034.325635
58525952Lord of the Rings: The Two Towers, The (2002)20024.321746
48994993Lord of the Rings: The Fellowship of the Ring, The (2001)20014.321311
24872571Matrix, The (1999)19994.315341
523527Schindler's List (1993)19934.307681
257260Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)19774.301846
11871213Goodfellas (1990)19904.281674
352356Forrest Gump (1994)19944.267944
293296Pulp Fiction (1994)19944.262625
59166016City of God (Cidade de Deus) (2002)20024.260427
912733794Batman Begins (2005)20054.240374
12631291Indiana Jones and the Last Crusade (1989)19894.234428
19352019Seven Samurai (Shichinin no samurai) (1954)19544.223034
11841210Star Wars: Episode VI - Return of the Jedi (1983)19834.215607

Not too different.

Let’s add some more filtering capabilities to our prediction generator.

def get_preds(learner, ratings, user_id=99999, num_recs=20, exclude_terms=None):
    all_movies = pd.DataFrame({'userId': [user_id] * len(movies), 'movieId': movies['id']})
    preds = learner.get_preds(dl=learner.dls.test_dl(all_movies))[0].numpy()

    recommendations = pd.DataFrame({
        'movie_id': movies['id'],
        'title': movies['name'],
        'year': movies['year'],
        'predicted_rating': preds
    })

    rated_movies = ratings[ratings['userId'] == user_id]['movieId'].values
    recommendations = recommendations[~recommendations['movie_id'].isin(rated_movies)]

    if exclude_terms:
        for term in exclude_terms:
            recommendations = recommendations[~recommendations['title'].str.contains(term, case=False)]

    return recommendations.sort_values('predicted_rating', ascending=False).head(num_recs)
recommendations = get_preds(learner5, ratings2, exclude_terms=['Star Wars', 'Lord of the Rings'])
recommendations

movie_idtitleyearpredicted_rating
843858Godfather, The (1972)19724.525020
11731198Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)19814.468911
11951221Godfather: Part II, The (1974)19744.464158
108110Braveheart (1995)19954.433938
19442028Saving Private Ryan (1998)19984.367393
24872571Matrix, The (1999)19994.315341
523527Schindler's List (1993)19934.307681
11871213Goodfellas (1990)19904.281674
352356Forrest Gump (1994)19944.267944
293296Pulp Fiction (1994)19944.262625
59166016City of God (Cidade de Deus) (2002)20024.260427
912733794Batman Begins (2005)20054.240374
12631291Indiana Jones and the Last Crusade (1989)19894.234428
19352019Seven Samurai (Shichinin no samurai) (1954)19544.223034
895912Casablanca (1942)19424.207191
34893578Gladiator (2000)20004.201711
11131136Monty Python and the Holy Grail (1975)19754.175015
22452329American History X (1998)19984.164785
946844555Lives of Others, The (Das Leben der Anderen) (2006)20064.145649
24182502Office Space (1999)19994.134266

Nice. Lots of movies I have heard of or have seen but haven’t added ratings for (which I liked).

It turns out MovieLens is a movie recommendation service (they’ve kindly provided their data to learn from). Given my success here, I will probably try out the service.

Wrapping up#

This exploration was great for getting a feel for how hyperparameter tuning can affect model predictions and an appreciation for the time constraints you can run into when trying to train a lot of slightly varied models. Specifically, I now have a deeper appreciation for the potential challenges one might face when trying to continuously train a model given new data. Since training isn’t always fast, accomplishing this isn’t so trivial, like writing the new data to a database might be.

It was fun to generate personalized predictions and I actually ended up watching The Usual Suspects after getting the recommendation from the model, which I enjoyed.