I’m making another, more thorough pass of course.fast.ai, including all notebooks and videos and this time I am going to focus more on the projects.
I’ll also be logging a lot more notes as doing so is by far the most effective way that I learn things.
The course materials are very detailed but I’ve still run into some rough edges.
The image search for bird vs. forest image classifier didn’t quite work without some modifications to make the search work.
Also, the recommended approach for working the textbook notebooks is on Google Colab, which requests a large number of account permissions for accessing my Google account masquerading as “Google Drive for Desktop” and doesn’t make me feel great.
I was able to run most of the examples on my personal computer, but training the model for the IMDB movie review classifier was quite slow.
I decided it might be worth trying out Colab, since I imagine there could be several more models of this size/complexity I’ll want to train and finding a reasonably fast way to do that will be useful.
I went back to the Colab notebook and tried running the cat-or-not classification example.
This seemed to take longer than it did on my local machine with an apparent ETA of ~30 minutes.
A nice writeup by Eugene on building a simple data viewer webapp with a few different framworks.
I am going to need to try out including llm-ctx.txt
next time I write FastHTML to see if it helps make the language model better at writing it.
I was going to write a quick guide on how to get up and running using Google’s Gemini model via API, since I found it quite straightforward and Twitter is currently dunking on Google for how hard this is.
When I tried to retrace my steps, the CSS for the documentation was failing to load with a 503, so I guess this will have to wait until another day.
I am continuing to see a lot of buzz about ColPali and Qwen2-VL.
I’d like to try these out but haven’t put together enough of the pieces to make sense of it yet.
I am also seeing a lot of conversation about how traditional OCR to LLM pipelines will be superseded by these approaches.
Based on my experience with VLMs, this seems directionally correct.
The overall amount of noise makes it tough to figure out what is worth focusing on and what is real vs. hype.
Played around a bit with baml
for extraction structured data with a VLM.
It’s an interesting approach and has better ergonomics and tooling from most things I’ve tried so far.
I like how you can declare test cases in the same place as the object schemas and that there is a built-in playground.
I need to see how to handle multi-step pipelines.
I experimented with doing data extraction from pictures of menus.
Early results were mixed.
I think my photo quality isn’t great and that might be one of the bigger issues.
Great to see more concrete results published on how different models are “the best” at writing different programming languages.
Iterating on Cogno, improving the “remaining guesses” and sharing functionality.
Language models can’t
- generate instructions for knitting patterns
- generate crossword puzzles from scatch
Language models can
- generate Connections puzzles
Incredible read: https://eieio.games/essays/the-secret-in-one-million-checkboxes/
I failed many attempts at getting Sonnet to write code to display the folder structure of the output of a tree -F
command using shortcodes.
After a lot of prompting, I wrote a mini-design doc on how the feature needed to be implemented and used it as context for Sonnet.
I tried several variants of instructions in the design including trying to improve it with the model itself for clarity.
I validated that the model could translate from the tree -F
to the html markup directly.
It could.
That in fact is the example html target document in my design doc. Here is that doc:
I tried Townie.
As has become tradition, I tried to build a writing editor for myself.
Townie got a simple version of this working with the ability to send a highlighted selection of text to the backend and run it through a model along with a prompt.
This experience was relatively basic, using a textarea and a popup.
From here, I got Townie to add the ability to show diffs between the model proposal and original text.
It was able to do this for the selected text using CSS in a straightforward manner.
I wanted to support multiple line diffs and diffs across multiple sections of the file.
I suggested we use an open source text editor.
At this point, things started to break.
The app stopped rendering and I wasn’t able to prompt it into resolving the issue.
I did manage to get it to revert (fixing forward) to a state where the app rendered again.
However, the LLM-completion hotkey was broken.
I’ve been trying out Cursor’s hyped composer mode with Sonnet.
I am a bit disappointed.
Maybe I shouldn’t be.
I think it’s not as good as I expected because I hold Cursor to a higher bar than the other developer tools out there.
It’s possible it’s over-hyped or that I am using it suboptimally.
But it’s more or less of the same quality as most of the tools of the same level of abstraction like aider
, etc.
I am trying to create a multipane, React-based writing app.
It’s possible I need to provide more detailed description than I am giving so far.
However, my main complaint after running it is that now I have a ton of code that isn’t quite right and I don’t know where or why it’s sort of broken.
Now, I need to read all the code.
This approach is notably less productive than slowly building up an app with LLM-code generation, because after each generation I can test the new code and make sure it does what I intended (or write automated tests to do that).
The code I get out of Composer doesn’t do what I want, but the LLM doesn’t know why, either because my high level task is under-specified, it doesn’t have enough context, or the ask is too vague.
I don’t usually run into this issue when I use cmd+k.
Maybe, I need to watch some videos of folks using it.