I ran the code from my Fine-tuning “Connections” post using gpt-4o-mini. I was hoping the results might be a bit better, which could motivate an effort to fine-tune the model. I’m not sure where my original version of this code went, so I reconstructed a repo for it. Once I was done, I ran 100 prompts through the model to get a sense of where its baseline performance was.

Correct: 2.00%
Incorrect: 98.00%
Total Categories Correct: 19.25%

Not great, and not much different from gpt-3.5-turbo. With these kind of results, I wasn’t particularly motivated to put the effort in to do more fine tunes.


I read through the instructor, marvin and open-interpreter docs for the first time in a while. It has been interesting to see these libraries grow and diverge. I also read through how Jason has been structuring an evals repo for instructor.