In this notebook/post, we’re going to be using the markdown content from my blog to try a language model. From this, we’ll attempt to prompt the model to generate a post for a topic I might write about.
Let’s import fastai
and disable warnings since these pollute the notebook a lot when I’m trying to convert these notebooks into posts (I am writing this as a notebook and converting it to a markdown file with this script).
from fastai.text.all import *
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
Loading the data#
The written content in my blog is markdown (.md) files.
You can see the raw contents of any of these posts by appending /index.md
to the end of the URL on any post on this site.
They look something like
---
<frontmatter key values>
---
<the rest of the post with code, links, images, shortcodes, etc.>
To get started, I modeled my approach after the one used in chapter 10, which looks something like
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
There were a few modifications I needed to make with the approach.
For starters, we were loading .md
files rather than text files, so initially I tried to do this with
path = Path("./data/content")
files = get_files(path, extensions='.md', recurse=True)
for f in files[:3]:
print(f)
data/content/posts/2013/2013-07-05-qc.md
data/content/posts/2024/models-writing-about-coding-with-models.md
data/content/posts/2024/vlms-hallucinate.md
However, these seemed to cause opaque and confusing issues with DataBlock
or DataLoaders
that manifested something like this
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[209], line 10
3 # First, create a tokenizer
4 text_processor = TextBlock.from_folder(path, is_lm=True)
6 dls_lm = DataBlock(
7 blocks=text_processor,
8 get_items=get,
9 splitter=RandomSplitter(0.1)
---> 10 ).dataloaders(
11 path,
12 path=path,
13 bs=128,
14 seq_len=80,
15 )
File ~/dev/lab/fastbook_projects/blog_post_generator/.venv/lib/python3.12/site-packages/fastai/data/block.py:157, in DataBlock.dataloaders(self, source, path, verbose, **kwargs)
151 def dataloaders(self,
152 source, # The data source
153 path:str='.', # Data source and default `Learner` path
154 verbose:bool=False, # Show verbose messages
155 **kwargs
156 ) -> DataLoaders:
--> 157 dsets = self.datasets(source, verbose=verbose)
158 kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
...
387 self.types.append(type(x))
--> 388 types = L(t if is_listy(t) else [t] for t in self.types).concat().unique()
389 self.pretty_types = '\n'.join([f' - {t}' for t in types])
TypeError: 'NoneType' object is not iterable
To workaround this challenge, I changed all the file extensions to .txt
.
This allowed the model to load and tokenize the dataset.
Next, I had an issue with encoding
return f.read()
^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
I solved this in the copy_and_rename_files
function with
src_file.read_text(encoding=encoding)
There is also a function called clean_content
which I will address later.
import re
def clean_content(text):
# Handle code blocks with language specifiers
text = re.sub(r'```(\w+)?\n(.*?)```', lambda m: f'<CODE>{m.group(2)}</CODE>', text, flags=re.DOTALL)
# Replace single backticks
text = re.sub(r'`([^`]+)`', r'<INLINE_CODE>\1</INLINE_CODE>', text)
return text
def copy_and_rename_files():
src_dir = Path("./data/content")
dst_dir = Path("./data/cleaned_content")
if not dst_dir.exists():
dst_dir.mkdir(parents=True)
for src_file in src_dir.rglob("*.md"):
try:
content = None
for encoding in ['utf-8', 'latin-1', 'cp1252']:
try:
content = src_file.read_text(encoding=encoding)
if content.startswith('---'):
# Remove markdown frontmatter
parts = content.split('---', 2)
if len(parts) >= 3:
content = parts[2]
break
except UnicodeDecodeError:
continue
if content is None:
print(f"Skipping {src_file}: Unable to decode with supported encodings")
continue
content = clean_content(content)
rel_path = src_file.relative_to(src_dir)
dst_file = dst_dir / rel_path.with_suffix('.txt')
dst_file.parent.mkdir(parents=True, exist_ok=True)
dst_file.write_text(content, encoding='utf-8')
except Exception as e:
print(f"Error processing {src_file}: {str(e)}")
copy_and_rename_files()
Training the model#
With the needed adjustments to the dataset made, and the model-ready content now in the data/cleaned_content
folder, we can load and tokenize that data with fastai
.
We validate that we can read the files paths with our code
path = Path("./data/cleaned_content")
files = get_text_files(path)
for f in files[:3]:
print(f)
data/cleaned_content/posts/2013/2013-07-05-qc.txt
data/cleaned_content/posts/2024/language-model-based-aggregators.txt
data/cleaned_content/posts/2024/making-your-vision-real.txt
Then we create a DataBlock
and view the loaded, tokenized content
get = partial(get_text_files, folders=['posts', 'til', 'logs', 'projects'])
text_processor = TextBlock.from_folder(path, is_lm=True)
dls_lm = DataBlock(
blocks=text_processor,
get_items=get,
splitter=RandomSplitter(0.1)
).dataloaders(
path,
path=path,
bs=128,
seq_len=512,
)
dls_lm.show_batch(max_n=2)
text | text_ | |
---|---|---|
0 | xxbos xxmaj i 've been wanting to create a chat component for this site for a while , because i really do n't like quoting conversations and manually formatting them each time . \n xxmaj when using a model playground , usually there is a code snippet option that generates xxmaj python code you can copy out intro a script . \n xxmaj using that feature , i can now copy the message list and paste it as xxup json into a xxmaj hugo shortcode and get results like this : \n\n\n { { < chat model="gpt-4o - mini " > } } \n [ \nā { \nā " role " : " system " , \nā " content " : [ \nā { \nā " type " : " text " , \nā " text " : " you should respond with the understanding they are an experienced software | xxmaj i 've been wanting to create a chat component for this site for a while , because i really do n't like quoting conversations and manually formatting them each time . \n xxmaj when using a model playground , usually there is a code snippet option that generates xxmaj python code you can copy out intro a script . \n xxmaj using that feature , i can now copy the message list and paste it as xxup json into a xxmaj hugo shortcode and get results like this : \n\n\n { { < chat model="gpt-4o - mini " > } } \n [ \nā { \nā " role " : " system " , \nā " content " : [ \nā { \nā " type " : " text " , \nā " text " : " you should respond with the understanding they are an experienced software engineer |
1 | xxunk / aka \n [ sublime xxup cli ] : https : / / xxrep 3 w xxunk / docs / 2 / xxunk xxbos i built a [ site](https : / / xxunk / ) to host a language model generated kid ās book i built using chatgpt and xxmaj midjourney . xxmaj the plot was sourced from my book xxunk to see how a language model would perform writing a xxunk with an unusual plot . \n xxmaj it prompted some interesting conversations about the role of xxunk and art in culture on the [ xxunk xxunk : / / news.ycombinator.com / xxunk ) . \n\n xxmaj tech : xxmaj react , xxmaj next.js , openai , xxmaj midjourney , xxmaj vercel \n\n [ source code](https : / / github.com / danielcorin / adventure - of - xxunk / ) xxbos i wrote a tiny site to use | / aka \n [ sublime xxup cli ] : https : / / xxrep 3 w xxunk / docs / 2 / xxunk xxbos i built a [ site](https : / / xxunk / ) to host a language model generated kid ās book i built using chatgpt and xxmaj midjourney . xxmaj the plot was sourced from my book xxunk to see how a language model would perform writing a xxunk with an unusual plot . \n xxmaj it prompted some interesting conversations about the role of xxunk and art in culture on the [ xxunk xxunk : / / news.ycombinator.com / xxunk ) . \n\n xxmaj tech : xxmaj react , xxmaj next.js , openai , xxmaj midjourney , xxmaj vercel \n\n [ source code](https : / / github.com / danielcorin / adventure - of - xxunk / ) xxbos i wrote a tiny site to use to |
That looks good, so we create a learner then run the same approach as is done in Chapter 10, checkpointing as we go.
We don’t have to do it this way – we could just call fit_one_cycle
the number of times we want – but it was helpful for me to validate the process end-to-end once more.
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]
).to_fp16()
learn.fit_one_cycle(1, 2e-2)
epoch | train_loss | valid_loss | accuracy | perplexity | time |
---|---|---|---|---|---|
0 | 4.923746 | 4.835233 | 0.222222 | 125.867935 | 00:12 |
learn.save('1epoch')
Path('data/cleaned_content/models/1epoch.pth')
learn = learn.load('1epoch')
Now, we do the bulk of the fine-tuning.
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
epoch | train_loss | valid_loss | accuracy | perplexity | time |
---|---|---|---|---|---|
0 | 3.721324 | 4.359152 | 0.278559 | 78.190788 | 00:12 |
1 | 3.582399 | 3.972712 | 0.338368 | 53.128395 | 00:13 |
2 | 3.452389 | 3.608671 | 0.379557 | 36.916973 | 00:13 |
3 | 3.307291 | 3.372669 | 0.413889 | 29.156248 | 00:13 |
4 | 3.221604 | 3.291163 | 0.422917 | 26.874105 | 00:13 |
5 | 3.131132 | 3.213343 | 0.436892 | 24.862070 | 00:13 |
6 | 3.047300 | 3.147552 | 0.447179 | 23.278996 | 00:13 |
7 | 2.978492 | 3.120242 | 0.455946 | 22.651857 | 00:14 |
8 | 2.921093 | 3.109544 | 0.458030 | 22.410818 | 00:14 |
9 | 2.876113 | 3.107162 | 0.458464 | 22.357492 | 00:12 |
learn.save_encoder('finetuned')
Experimenting with the Result#
With the fine-tuned model, we can run inference!
We define the beginning of the content we want the model to output in TEXT
, then we call learn.predict
for the number of tokens we want the model to output and set temperature to determine the randomness/creativity of the output.
TEXT = "I am interacting with a language model as a thought partner to"
N_WORDS = 256
TEMP = 0.75
pred = learn.predict(TEXT, N_WORDS, temperature=TEMP)
print(pred)
i am interacting with a language model as a thought partner to train .
I think it should be more difficult to use Sonnet but it is often difficult to get to because i am a very familiar language model .
In this way , i wanted to use the phrase Sonnet > rather than just see the phrase as a word .
He 's also using the word " prompt " , which is a word word that is used in art , then as a title to describe a language model that extract structured data from an individual 's experience .
I 've used this idea to describe how a language model has a structure with a single structure and i can try and solve this with the following following Sonnet code : i n't have such a thead for a language model .
Being like this is an easy way to design an [ initial Sonnet ] .
In another example , where you would have written a single word to explicitly describe a language model , i had to use the following word usage : < inline_code >
< / INLINE_CODE > : The word i used to read the script in a language model and then describe the word as an JSON object .
This working was quite different from my JSON code and Python CODE >
the most recent model to use this PYTHON code
The output is a little wild but it kinda sorta makes sense and isn’t too bad for a model I trained in a couple minutes on my MacBook.
I tweaked several parts of the approach in effort to improve the model’s output quality.
Jumping back to the clean_content
function, I found that removing the markdown frontmatter and replacing triple backticks with single tokens seemed to make the output make a bit more sense.
When this fine-tuned model tries to generate code, it makes little sense and does strange things like emit tokens with triple words like importimportimport
.
I have a feeling this deficiency may be because the base model wasn’t trained on much source code.
So there we have it. A simple language model fine-tuned on my blog posts. This was a helpful experience for getting a feel for some feature engineering.
If you liked this post, be sure to check out some of my other notebooks I’ve built while working through the FastAI Course linked below.