The following code allowed me to successfully download the IMDB dataset with fastai to a Modal volume:

import os

os.environ["FASTAI_HOME"] = "/data/fastai"

from fastai.text.all import *

app = modal.App("imdb-dataset-train")
vol = modal.Volume.from_name("modal-llm-data", create_if_missing=True)


@app.function(
    gpu="any",
    image=modal.Image.debian_slim().pip_install("fastai"),
    volumes={"/data": vol},
)
def download():
    path = untar_data(URLs.IMDB)
    print(f"Data downloaded to: {path}")
    return path

run with

modal run train.py::download

Next, I tried to run one epoch of training of the language model

@app.function(
    gpu="h100",
    image=modal.Image.debian_slim().pip_install("fastai"),
    volumes={"/data": vol},
    timeout=20 * 60,
)
def train():
    path = untar_data(URLs.IMDB)
    print(f"Training with data from: {path}")
    get_imdb = partial(get_text_files, folders=["train", "test", "unsup"])

    dls_lm = DataBlock(
        blocks=TextBlock.from_folder(path, is_lm=True),
        get_items=get_imdb,
        splitter=RandomSplitter(0.1),
    ).dataloaders(path, path=path, bs=128, seq_len=80)

    print("Sample from datablock:")
    print(dls_lm.show_batch(max_n=2))

    learn = language_model_learner(
        dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]
    ).to_fp16()

    learn.fit_one_cycle(1, 2e-2)
    learn.save("1epoch")

I was waiting around for a long time, having never seen “Sample from datablock:” print. Looking into the volume with the Modal UI, I noticed the /fastai/data/imdb_tok/unsup folder has been modified recently. It seemed like the tokenization of the dataset was talking a long time. I was able to do this tokenization quite fast locally, so I am going to chalk this up to the Modal volume not being as performant as a local file system. While I’m not 100%, I think the need to train with so many little files may undermine my ability to train this model on Modal.