About 6 months ago, I experimented with running a few different multi-modal (vision) language models on my Macbook.
At the time, the results weren’t so great.
An experiment
With a slight modification to the script from that post, I tested out llama3.2-vision
11B (~8GB in size between the model and the projector).
Using uv
and inline script dependencies, the full script looks like this
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "ollama",
# ]
# ///
import os
import sys
import ollama
PROMPT = "Describe the provided image in a few sentences"
def run_inference(model: str, image_path: str):
stream = ollama.chat(
model=model,
messages=[{"role": "user", "content": PROMPT, "images": [image_path]}],
stream=True,
)
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)
def main():
if len(sys.argv) != 3:
print("Usage: python run.py <model_name> <image_path>")
sys.exit(1)
model_name = sys.argv[1]
image_path = sys.argv[2]
if not os.path.exists(image_path):
print(f"Error: Image file '{image_path}' does not exist.")
sys.exit(1)
run_inference(model_name, image_path)
if __name__ == "__main__":
main()
We run it with
Deepseek V3 was recently released: a cheap, reliable, supposedly GPT-4 class model.
Quick note upfront, according to the docs, there will be non-trivial price increases in February 2025:
- Input price (cache miss) is going up to
$0.27
/ 1M tokens from $0.14
/ 1M tokens (~2x) - Output price is going up to
$1.10
/ 1M tokens from $0.28
/1M tokens (~4x)
From now until 2025-02-08 16:00 (UTC), all users can enjoy the discounted prices of DeepSeek API
This year included a lot of writing and learning new things.
My goals for the year were the following
Train a machine learning model and write about it
- I’ve been learning ML in reverse, first playing with language models and now learning more about what it actually takes to construct a system capable of ML inference. Training my own models feels like the next step to develop depth of understanding in this area.
Build search for my blog
I’ve been building an Electron app called “Delta”.
Delta is a tool for knowledge exploration and ideation through the branching of conversations with language models.
I have lots of ideas for how I want to make this idea useful and valuable, but today it looks like this.
This article is about my struggles building Delta using Electron and how I eventually found workable, though likely suboptimal, solutions to these challenges.
I’m aiming to setup a space for more interactive UX experiments.
My current Hugo blog has held up well with my scale of content but doesn’t play nicely with modern Javascript frameworks, where most of the open source energy is currently invested.
Astro seemed like a promising option because it supports Markdown content along with plug-and-play approach to many different frameworks like React, Svelte and Vue.
More importantly, there is a precedent for flexibility when the Next Big Thing emerges which makes Astro a plausible test bed for new concepts without requiring a brand new site or a rewrite.
At least, this was my thought process when I decided to try it out.
In this notebook, we’ll use the MovieLens 10M dataset and collaborative filtering to create a movie recommendation model.
We’ll use the data from movies.dat
and ratings.dat
to create embeddings that will help us predict ratings for movies I haven’t watched yet.
Create some personal data
Before I wrote any code to train models, I code-generated a quick UI to rate movies to generate my_ratings.dat
, to append to ratings.dat
.
There is a bit of code needed to do that.
The nice part is using inline script metadata and uv
, we can write (generate) and run the whole tool in a single file.
I’ve started posting more on Bluesky and I noticed that articles from my site didn’t have social image previews 😔
I looked into Poison’s code (the theme this site is based on) and found that it supports social image previews at the site level or in the site’s assets
folder.
This approach didn’t quite work for me.
I recently switched to using page bundles which group markdown and content in the same folder and make linking to images from markdown straightforward.
With a few modifications, I was able to make the code work to use images in the page bundles for social previews as well.
I explored how embeddings cluster by visualizing LLM-generated words across different categories.
The visualizations helped build intuition about how these embeddings relate to each other in vector space. Most of the code was generated using Sonnet.
!pip install --upgrade pip
!pip install openai
!pip install matplotlib
!pip install scikit-learn
!pip install pandas
!pip install plotly
!pip install "nbformat>=4.2.0"
We start by setting up functions to call ollama
locally to generate embeddings and words for several categories.
The generate_words
function occasionally doesn’t adhere to instructions, but the end results are largely unaffected.
Language models are more than chatbots - they’re tools for thought.
The real value lies in using them as intellectual sounding boards to brainstorm, refine and challenge our ideas.
What if you could explore every tangent in a conversation without losing the thread?
What if you could rewind discussions to explore different paths?
Language models make this possible.
This approach unlocks your next level of creativity and productivity.
Context Quality Counts
I’ve found language models useful for iterating on ideas and articulating thoughts. Here’s an example conversation (feel free to skip; this conversation is used in the examples later on):
Using Cursor, we can easily get a first pass at creating alt text for an image using a language model.
It’s quite straightforward using a multi-modal model/prompt.
For this example, we’ll use claude-3-5-sonnet-20241022
.
Here’s what it generates.