Amazon Bedrock On Demand Throughput Error

[TIL] June 30, 2025

I was working with Amazon Bedrock to run LLM inference. AWS has its fair share of complexity – VPCs, subnets, security groups, etc.

On the surface, running inference on Amazon Bedrock is straightforward. A simple script might look like this (assuming you have proper environment variables set):

bedrock = boto3.client('bedrock-runtime', region_name='us-east-2')

messages = [{"role": "user", "content": [{"text": "What is 2+2?"}]}]

res = bedrock.converse(modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", messages=messages)
print(res['output']['message']['content'][0]['text'])

We can find this model name in the us-east-2 model catalog (requires sign in).

OpenAI Compatible API Implementation Variance

[TIL] May 29, 2025

openai_compatible_api

Lots of language model providers implement the OpenAI API spec. These look similar in shape but often behave differently in subtle ways. Anthropic’s prefill sequences are one such example.

I wasn’t able to find a canonical definition of this spec. In practice, we can show the basic shape of the API for chatting with a few examples.

OpenAI:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4.1",
    "messages": [
      {"role": "user", "content": "Hi"},
      {"role": "assistant", "content": ""}
    ]
  }'

{
  "id": "chatcmpl-BcibN0BPNN8ysOymTxonvePn0m3n6",
  "object": "chat.completion",
  "created": 1748567613,
  "model": "gpt-4.1-2025-04-14",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today? 😊",
        "refusal": null,
        "annotations": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 10,
    "total_tokens": 22,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "service_tier": "default",
  "system_fingerprint": "fp_799e4ca3f1"
}

Anthropic:

Using `llm` to handle large input context

[TIL] May 16, 2025

Today, I ran into an issue where I wanted to use repomix to pack a large codebase into a single file to pass to an LLM, but I couldn’t paste the output into any of the UIs I typically use. The React apps all became sluggish as I waited for ~500,000 tokens to paste.

Enter llm.

I solved this problem with a bash one-liner:

llm -m gemini-2.5-pro-preview-05-06 "The provided context is all the code of an old codebase. Analyze this code and come up with high impact, meaningful improvements to make the codebase easier to work with." < repomix.output

The API and subsequent inference took about 85 seconds, and I had my response.

Git Worktree

[TIL] May 10, 2025

These days I use agents that write code often. When I am trying to build a new feature, I first write a markdown spec, then point the agent at it and send it on its way.

There are a lot of tools and choices today in the agent space. I regularly use 3-4 different ones, and I expect that number to continue to vary.

When you send an agent off to write code, you need to wait. If you have more work to do, especially work that is unrelated to the current changes the agent is making, it would be nice to unblock that work as well.

Adding an llms.txt file to Hugo

[TIL] February 25, 2025

Today, I set out to add an llms.txt to this site. I’ve made a few similar additions in the past with raw post markdown files and a search index. Every time I try and change something with outputFormats in Hugo, I forget one of the steps, so in writing this up, finally I’ll have it for next time.

Steps

First, I added a new output format in my config.toml file:

[outputFormats.TXT]
mediaType = "text/plain"
baseName = "llms"
isPlainText = true

Then, I added this format to my home outputs:

Claude Code

[TIL] February 24, 2025

Today, Anthropic entered the LLM code tools party with Claude Code.

Coding with LLMs is one of my favorite activities these days, so I’m excited to give it a shot. As a CLI tool, it seems most similar to aider and goose, at least of the projects I am familiar with.

Be forewarned, agentic coding tools like Claude Code use a lot of tokens which are not free. Monitor your usage carefully as you use it or know you may spend more than you expect.

Cursor Triple Backticks Stop Sequence

[TIL] February 17, 2025

An LLM stop sequence is a sequence of tokens that tells the LLM to stop generating text. I previously wrote about stop sequences and prefilling responses with Claude.

As a reference, here’s how to use a stop sequence with the OpenAI API in Python

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[{"role": "user", "content": "What is the capital of France?"}],
  stop=["Paris"],
)
print(response.choices[0].message.content)

which outputs something like

'The capital of France is '

Notice the LLM never outputs the word “Paris”. This is due to the stop sequence.

Astro Code Toggle Component

[TIL] February 6, 2025

astro

I built an Astro component called CodeToggle.astro for my experimental site. The idea was to create a simple wrapper around a React (or other interactive component) in an MDX file so that the source of that rendered component could be nicely displayed as a highlighted code block on the click of a toggle. Usage looks like this:

import { default as TailwindCalendarV1 } from "./components/TailwindCalendar.v1";
import TailwindCalendarV1Source from "./components/TailwindCalendar.v1?raw";

<CodeToggle source={TailwindCalendarV1Source}>
  <TailwindCalendarV1 client:load />
</CodeToggle>

The implementation of CodeToggle.astro looked like this

Running Deepseek Janus Pro 7B on a Macbook

[TIL] January 28, 2025

Deepseek is getting a lot of attention with the releases of V3 and recently R1. Yesterday, they also released “Pro 7B” version of Janus, a “Unified Multimodal” model that can generate images from text and text from images. Most models I’ve experimented with can only do one of the two.

The 7B model requires about 15GB of hard disk space. It also seemed to almost max out the 64GB of memory my machine has. I’m not deeply familiar with the hardware requirements for this model so your mileage may vary.

`llm` upgrade pip

[TIL] January 26, 2025

The llm package uses a plugin architecture to support numerous different language model API providers and frameworks. Per the documentation, these plugins are installed using a version of pip, the popular Python package manager

Use the llm install command (a thin wrapper around pip install) to install plugins in the correct environment: llm install llm-gpt4all

Because this approach makes use of pip occasionally we run into familiar issues like pip being out of date and complaining about it on every use