The following code allowed me to successfully download the IMDB dataset with fastai to a Modal volume:

import os os.environ["FASTAI_HOME"] = "/data/fastai" from fastai.text.all import * app = modal.App("imdb-dataset-train") vol = modal.Volume.from_name("modal-llm-data", create_if_missing=True) @app.function( gpu="any", image=modal.Image.debian_slim().pip_install("fastai"), volumes={"/data": vol}, ) def download(): path = untar_data(URLs.IMDB) print(f"Data downloaded to: {path}") return path

run with

modal run train.py::download

Next, I tried to run one epoch of training of the language model

@app.function( gpu="h100", image=modal.Image.debian_slim().pip_install("fastai"), volumes={"/data": vol}, timeout=20 * 60, ) def train(): path = untar_data(URLs.IMDB) print(f"Training with data from: {path}") get_imdb = partial(get_text_files, folders=["train", "test", "unsup"]) dls_lm = DataBlock( blocks=TextBlock.from_folder(path, is_lm=True), get_items=get_imdb, splitter=RandomSplitter(0.1), ).dataloaders(path, path=path, bs=128, seq_len=80) print("Sample from datablock:") print(dls_lm.show_batch(max_n=2)) learn = language_model_learner( dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()] ).to_fp16() learn.fit_one_cycle(1, 2e-2) learn.save("1epoch")

I was waiting around for a long time, having never seen β€œSample from datablock:” print. Looking into the volume with the Modal UI, I noticed the /fastai/data/imdb_tok/unsup folder has been modified recently. It seemed like the tokenization of the dataset was talking a long time. I was able to do this tokenization quite fast locally, so I am going to chalk this up to the Modal volume not being as performant as a local file system. While I’m not 100%, I think the need to train with so many little files may undermine my ability to train this model on Modal.