Practical Deep Learning, Lesson 4, Language Model Blog Post Imitator

[TIL] November 4, 2024

In this notebook/post, we’re going to be using the markdown content from my blog to try a language model. From this, we’ll attempt to prompt the model to generate a post for a topic I might write about.

Let’s import fastai and disable warnings since these pollute the notebook a lot when I’m trying to convert these notebooks into posts (I am writing this as a notebook and converting it to a markdown file with this script).

from fastai.text.all import *
from pathlib import Path

import warnings
warnings.filterwarnings('ignore')

Loading the data#

The written content in my blog is markdown (.md) files. You can see the raw contents of any of these posts by appending /index.md to the end of the URL on any post on this site. They look something like

---
<frontmatter key values>
---

<the rest of the post with code, links, images, shortcodes, etc.>

To get started, I modeled my approach after the one used in chapter 10, which looks something like

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

There were a few modifications I needed to make with the approach. For starters, we were loading .md files rather than text files, so initially I tried to do this with

path = Path("./data/content")
files = get_files(path, extensions='.md', recurse=True)
for f in files[:3]:
    print(f)

data/content/posts/2013/2013-07-05-qc.md
data/content/posts/2024/models-writing-about-coding-with-models.md
data/content/posts/2024/vlms-hallucinate.md

However, these seemed to cause opaque and confusing issues with DataBlock or DataLoaders that manifested something like this

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[209], line 10
      3 # First, create a tokenizer
      4 text_processor = TextBlock.from_folder(path, is_lm=True)
      6 dls_lm = DataBlock(
      7     blocks=text_processor,
      8     get_items=get,
      9     splitter=RandomSplitter(0.1)
---> 10 ).dataloaders(
     11     path,
     12     path=path,
     13     bs=128,
     14     seq_len=80,
     15 )

File ~/dev/lab/fastbook_projects/blog_post_generator/.venv/lib/python3.12/site-packages/fastai/data/block.py:157, in DataBlock.dataloaders(self, source, path, verbose, **kwargs)
    151 def dataloaders(self,
    152     source, # The data source
    153     path:str='.', # Data source and default `Learner` path
    154     verbose:bool=False, # Show verbose messages
    155     **kwargs
    156 ) -> DataLoaders:
--> 157     dsets = self.datasets(source, verbose=verbose)
    158     kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
...
    387     self.types.append(type(x))
--> 388 types = L(t if is_listy(t) else [t] for t in self.types).concat().unique()
    389 self.pretty_types = '\n'.join([f'  - {t}' for t in types])

TypeError: 'NoneType' object is not iterable

To workaround this challenge, I changed all the file extensions to .txt. This allowed the model to load and tokenize the dataset.

Next, I had an issue with encoding

    return f.read()
           ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

I solved this in the copy_and_rename_files function with

src_file.read_text(encoding=encoding)

There is also a function called clean_content which I will address later.

import re


def clean_content(text):
    # Handle code blocks with language specifiers
    text = re.sub(r'```(\w+)?\n(.*?)```', lambda m: f'<CODE>{m.group(2)}</CODE>', text, flags=re.DOTALL)

    # Replace single backticks
    text = re.sub(r'`([^`]+)`', r'<INLINE_CODE>\1</INLINE_CODE>', text)

    return text

def copy_and_rename_files():
    src_dir = Path("./data/content")
    dst_dir = Path("./data/cleaned_content")

    if not dst_dir.exists():
        dst_dir.mkdir(parents=True)

    for src_file in src_dir.rglob("*.md"):
        try:
            content = None
            for encoding in ['utf-8', 'latin-1', 'cp1252']:
                try:
                    content = src_file.read_text(encoding=encoding)
                    if content.startswith('---'):
                        # Remove markdown frontmatter
                        parts = content.split('---', 2)
                        if len(parts) >= 3:
                            content = parts[2]
                    break
                except UnicodeDecodeError:
                    continue

            if content is None:
                print(f"Skipping {src_file}: Unable to decode with supported encodings")
                continue

            content = clean_content(content)

            rel_path = src_file.relative_to(src_dir)
            dst_file = dst_dir / rel_path.with_suffix('.txt')
            dst_file.parent.mkdir(parents=True, exist_ok=True)
            dst_file.write_text(content, encoding='utf-8')

        except Exception as e:
            print(f"Error processing {src_file}: {str(e)}")

copy_and_rename_files()

Training the model#

With the needed adjustments to the dataset made, and the model-ready content now in the data/cleaned_content folder, we can load and tokenize that data with fastai.

We validate that we can read the files paths with our code

path = Path("./data/cleaned_content")
files = get_text_files(path)
for f in files[:3]:
    print(f)

data/cleaned_content/posts/2013/2013-07-05-qc.txt
data/cleaned_content/posts/2024/language-model-based-aggregators.txt
data/cleaned_content/posts/2024/making-your-vision-real.txt

Then we create a DataBlock and view the loaded, tokenized content

get = partial(get_text_files, folders=['posts', 'til', 'logs', 'projects'])

text_processor = TextBlock.from_folder(path, is_lm=True)

dls_lm = DataBlock(
    blocks=text_processor,
    get_items=get,
    splitter=RandomSplitter(0.1)
).dataloaders(
    path,
    path=path,
    bs=128,
    seq_len=512,
)

dls_lm.show_batch(max_n=2)

	text	text_
0	xxbos xxmaj i 've been wanting to create a chat component for this site for a while , because i really do n't like quoting conversations and manually formatting them each time . \n xxmaj when using a model playground , usually there is a code snippet option that generates xxmaj python code you can copy out intro a script . \n xxmaj using that feature , i can now copy the message list and paste it as xxup json into a xxmaj hugo shortcode and get results like this : \n\n\n { { < chat model="gpt-4o - mini " > } } \n [ \n▁ { \n▁ " role " : " system " , \n▁ " content " : [ \n▁ { \n▁ " type " : " text " , \n▁ " text " : " you should respond with the understanding they are an experienced software	xxmaj i 've been wanting to create a chat component for this site for a while , because i really do n't like quoting conversations and manually formatting them each time . \n xxmaj when using a model playground , usually there is a code snippet option that generates xxmaj python code you can copy out intro a script . \n xxmaj using that feature , i can now copy the message list and paste it as xxup json into a xxmaj hugo shortcode and get results like this : \n\n\n { { < chat model="gpt-4o - mini " > } } \n [ \n▁ { \n▁ " role " : " system " , \n▁ " content " : [ \n▁ { \n▁ " type " : " text " , \n▁ " text " : " you should respond with the understanding they are an experienced software engineer
1	xxunk / aka \n [ sublime xxup cli ] : https : / / xxrep 3 w xxunk / docs / 2 / xxunk xxbos i built a [ site](https : / / xxunk / ) to host a language model generated kid ’s book i built using chatgpt and xxmaj midjourney . xxmaj the plot was sourced from my book xxunk to see how a language model would perform writing a xxunk with an unusual plot . \n xxmaj it prompted some interesting conversations about the role of xxunk and art in culture on the [ xxunk xxunk : / / news.ycombinator.com / xxunk ) . \n\n xxmaj tech : xxmaj react , xxmaj next.js , openai , xxmaj midjourney , xxmaj vercel \n\n [ source code](https : / / github.com / danielcorin / adventure - of - xxunk / ) xxbos i wrote a tiny site to use	/ aka \n [ sublime xxup cli ] : https : / / xxrep 3 w xxunk / docs / 2 / xxunk xxbos i built a [ site](https : / / xxunk / ) to host a language model generated kid ’s book i built using chatgpt and xxmaj midjourney . xxmaj the plot was sourced from my book xxunk to see how a language model would perform writing a xxunk with an unusual plot . \n xxmaj it prompted some interesting conversations about the role of xxunk and art in culture on the [ xxunk xxunk : / / news.ycombinator.com / xxunk ) . \n\n xxmaj tech : xxmaj react , xxmaj next.js , openai , xxmaj midjourney , xxmaj vercel \n\n [ source code](https : / / github.com / danielcorin / adventure - of - xxunk / ) xxbos i wrote a tiny site to use to

That looks good, so we create a learner then run the same approach as is done in Chapter 10, checkpointing as we go. We don’t have to do it this way – we could just call fit_one_cycle the number of times we want – but it was helpful for me to validate the process end-to-end once more.

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]
).to_fp16()

learn.fit_one_cycle(1, 2e-2)

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	4.923746	4.835233	0.222222	125.867935	00:12

learn.save('1epoch')

Path('data/cleaned_content/models/1epoch.pth')

learn = learn.load('1epoch')

Now, we do the bulk of the fine-tuning.

learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	3.721324	4.359152	0.278559	78.190788	00:12
1	3.582399	3.972712	0.338368	53.128395	00:13
2	3.452389	3.608671	0.379557	36.916973	00:13
3	3.307291	3.372669	0.413889	29.156248	00:13
4	3.221604	3.291163	0.422917	26.874105	00:13
5	3.131132	3.213343	0.436892	24.862070	00:13
6	3.047300	3.147552	0.447179	23.278996	00:13
7	2.978492	3.120242	0.455946	22.651857	00:14
8	2.921093	3.109544	0.458030	22.410818	00:14
9	2.876113	3.107162	0.458464	22.357492	00:12

learn.save_encoder('finetuned')

Experimenting with the Result#

With the fine-tuned model, we can run inference! We define the beginning of the content we want the model to output in TEXT, then we call learn.predict for the number of tokens we want the model to output and set temperature to determine the randomness/creativity of the output.

TEXT = "I am interacting with a language model as a thought partner to"
N_WORDS = 256
TEMP = 0.75
pred = learn.predict(TEXT, N_WORDS, temperature=TEMP)

print(pred)

i am interacting with a language model as a thought partner to train . 
 I think it should be more difficult to use Sonnet but it is often difficult to get to because i am a very familiar language model . 

 In this way , i wanted to use the phrase Sonnet > rather than just see the phrase as a word . 
 He 's also using the word " prompt " , which is a word word that is used in art , then as a title to describe a language model that extract structured data from an individual 's experience . 
 I 've used this idea to describe how a language model has a structure with a single structure and i can try and solve this with the following following Sonnet code : i n't have such a thead for a language model . 
 Being like this is an easy way to design an [ initial Sonnet ] . 
 In another example , where you would have written a single word to explicitly describe a language model , i had to use the following word usage : < inline_code > 
 < / INLINE_CODE > : The word i used to read the script in a language model and then describe the word as an JSON object . 

 This working was quite different from my JSON code and Python CODE > 

 the most recent model to use this PYTHON code

The output is a little wild but it kinda sorta makes sense and isn’t too bad for a model I trained in a couple minutes on my MacBook.

I tweaked several parts of the approach in effort to improve the model’s output quality. Jumping back to the clean_content function, I found that removing the markdown frontmatter and replacing triple backticks with single tokens seemed to make the output make a bit more sense. When this fine-tuned model tries to generate code, it makes little sense and does strange things like emit tokens with triple words like importimportimport. I have a feeling this deficiency may be because the base model wasn’t trained on much source code.

So there we have it. A simple language model fine-tuned on my blog posts. This was a helpful experience for getting a feel for some feature engineering.

If you liked this post, be sure to check out some of my other notebooks I’ve built while working through the FastAI Course linked below.

✎ Edit

Raw

Practical Deep Learning, Lesson 4, Language Model Blog Post Imitator

Loading the data#

Training the model#

Experimenting with the Result#

Recommended Posts