2024-09-19

[logs] September 19, 2024

I finally found some time to run a more comprehensive evals of Connections with one guess at a time and using Python code to validate the guesses and give feedback. I ran about 100 puzzles with gpt-4o-mini, gp-4o, and claude-3-5-sonnet, but it became clear that Sonnet was going to perform the best, so I decide to only complete the 466 puzzles released as of today with Sonnet. This wasn’t cheap but it was interesting to see the results. I’m going to write up some more comprehensive findings and push the code soon.

2024-09-18

[logs] September 18, 2024

Some interesting commentary on the behaviors of founders, managers and leaders written by Rands.

Race’s article on using Jupyter notebooks with Hugo was a helpful intro to the landscape.

2024-09-13

[logs] September 13, 2024

There have been a number of small-in-scope, but tough problems that I’ve run into that models haven’t been able crack as l’ve presented them via prompting. Usually, these are problems with a few separate areas of complexity, like a recursive parser plus a weird templating language to do it in. o1 is the first model that I can recall that took my high level approach and suggested a simplifying change to the input (tree -F to tree -J -F) that meaningfuly simplified the problem’s complexity (the parser is no longer needed if the input is JSON). With this change and two followups to correct a hallucination, the model output a recursive Hugo template shortcode to render a filetree with collapsible folders.

2024-09-11

[logs] September 11, 2024

fastai_course

I’m making another, more thorough pass of course.fast.ai, including all notebooks and videos and this time I am going to focus more on the projects. I’ll also be logging a lot more notes as doing so is by far the most effective way that I learn things.

The course materials are very detailed but I’ve still run into some rough edges. The image search for bird vs. forest image classifier didn’t quite work without some modifications to make the search work. Also, the recommended approach for working the textbook notebooks is on Google Colab, which requests a large number of account permissions for accessing my Google account masquerading as “Google Drive for Desktop” and doesn’t make me feel great. I was able to run most of the examples on my personal computer, but training the model for the IMDB movie review classifier was quite slow. I decided it might be worth trying out Colab, since I imagine there could be several more models of this size/complexity I’ll want to train and finding a reasonably fast way to do that will be useful. I went back to the Colab notebook and tried running the cat-or-not classification example. This seemed to take longer than it did on my local machine with an apparent ETA of ~30 minutes.

2024-09-10

[logs] September 10, 2024

A nice writeup by Eugene on building a simple data viewer webapp with a few different framworks. I am going to need to try out including llm-ctx.txt next time I write FastHTML to see if it helps make the language model better at writing it.

2024-09-08

[logs] September 8, 2024

I was going to write a quick guide on how to get up and running using Google’s Gemini model via API, since I found it quite straightforward and Twitter is currently dunking on Google for how hard this is. When I tried to retrace my steps, the CSS for the documentation was failing to load with a 503, so I guess this will have to wait until another day.

2024-09-07

[logs] September 7, 2024

colpali

I am continuing to see a lot of buzz about ColPali and Qwen2-VL. I’d like to try these out but haven’t put together enough of the pieces to make sense of it yet. I am also seeing a lot of conversation about how traditional OCR to LLM pipelines will be superseded by these approaches. Based on my experience with VLMs, this seems directionally correct. The overall amount of noise makes it tough to figure out what is worth focusing on and what is real vs. hype.

2024-09-05

[logs] September 5, 2024

baml

Played around a bit with baml for extraction structured data with a VLM. It’s an interesting approach and has better ergonomics and tooling from most things I’ve tried so far. I like how you can declare test cases in the same place as the object schemas and that there is a built-in playground. I need to see how to handle multi-step pipelines.

I experimented with doing data extraction from pictures of menus. Early results were mixed. I think my photo quality isn’t great and that might be one of the bigger issues.

2024-09-02

[logs] September 2, 2024

Benchmarking >80 LLMs shows: The best model is not necessarily the best for your programming language 😱

- Best overall: Anthropic’s Sonnet 3.5
- Best for Go: Meta’s Llama 3.1 405B
- Best for Java: OpenAI’s GPT-4 Turbo
- Best for Ruby: OpenAI’s GPT-4o

Good models for one… pic.twitter.com/EYUphEI5rH
— Markus Zimmermann (@zimmskal) September 2, 2024

Great to see more concrete results published on how different models are “the best” at writing different programming languages.

Iterating on Cogno, improving the “remaining guesses” and sharing functionality.

2024-08-31

[logs] August 31, 2024

Language models can’t

generate instructions for knitting patterns
generate crossword puzzles from scatch

Language models can

generate Connections puzzles