The following code allowed me to successfully download the IMDB dataset with fastai to a Modal volume:
import os
os.environ["FASTAI_HOME"] = "/data/fastai"
from fastai.text.all import *
app = modal.App("imdb-dataset-train")
vol = modal.Volume.from_name("modal-llm-data", create_if_missing=True)
@app.function(
gpu="any",
image=modal.Image.debian_slim().pip_install("fastai"),
volumes={"/data": vol},
)
def download():
path = untar_data(URLs.IMDB)
print(f"Data downloaded to: {path}")
return path
run with
modal run train.py::download
Next, I tried to run one epoch of training of the language model
@app.function(
gpu="h100",
image=modal.Image.debian_slim().pip_install("fastai"),
volumes={"/data": vol},
timeout=20 * 60,
)
def train():
path = untar_data(URLs.IMDB)
print(f"Training with data from: {path}")
get_imdb = partial(get_text_files, folders=["train", "test", "unsup"])
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb,
splitter=RandomSplitter(0.1),
).dataloaders(path, path=path, bs=128, seq_len=80)
print("Sample from datablock:")
print(dls_lm.show_batch(max_n=2))
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]
).to_fp16()
learn.fit_one_cycle(1, 2e-2)
learn.save("1epoch")
I was waiting around for a long time, having never seen “Sample from datablock:” print.
Looking into the volume with the Modal UI, I noticed the /fastai/data/imdb_tok/unsup
folder has been modified recently.
It seemed like the tokenization of the dataset was talking a long time.
I was able to do this tokenization quite fast locally, so I am going to chalk this up to the Modal volume not being as performant as a local file system.
While I’m not 100%, I think the need to train with so many little files may undermine my ability to train this model on Modal.