Some unstructured thoughts on the types of tasks language models seem to be good (and bad) at completing:

A language model is an effective tool for solving problems when can describe the answer or output you want from it with language. A language model is a good candidate to replace manual processes performed by humans, where judgement or application of semantic rules is needed to get the right answer. Existing machine learning approaches are already good at classifying or predicting over a large number of features, specifically when one doesn’t know how things can or should be clustered or labelled just by looking at the data points. To give an example where a language model will likely not perform well: imagine you want to generate a prediction for the value of a house and the land it sits on, given a list of data points describing it:

Experimenting with using a language model to improve the input prompt, then use that output as the actual prompt for the model, then returning the result. It’s a bit of a play on the “critique” approach. Some of the outputs were interesting but I need a better way to evaluate the results.

import sys
import openai

MODEL = "gpt-3.5-turbo-16k"

IMPROVER_PROMPT = """
You are an expert prompt writer for a language model. Please convert the user's message into an effective prompt that will be sent to a language model to produce a helpful and useful response.

Output the improved prompt only.
"""

def generate_improved_prompt(prompt: str) -> str:
    completion = openai.ChatCompletion.create(
        model=MODEL,
        temperature=1.0,
        messages=[
            {
                "role": "system",
                "content": IMPROVER_PROMPT,
            },
            {
                "role": "user",
                "content": prompt,
            },
        ],
    )
    return completion.choices[0].message.content

def generate_completion(prompt: str) -> dict:
    completion = openai.ChatCompletion.create(
        model=MODEL,
        temperature=1.0,
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
    )
    return completion.choices[0].message.content

def main():
    prompt = ' '.join(sys.argv[1:])
    standard_result = generate_completion(prompt)
    print("Standard completion:")
    print(standard_result)
    improved_prompt = generate_improved_prompt(prompt)
    print("\nImproved prompt:")
    print(improved_prompt)
    improved_result = generate_completion(improved_prompt)
    print("Improved completion:")
    print(improved_result)
    return improved_result

if __name__ == "__main__":
    main()

2023-07-04

I’ve been working through a series on nix-flakes. It’s well written and shows some interesting applications of the tool set. I’m still trying to wrap my head around exactly where nix could fit in in my development lifecycle. It seems like it wraps up builds and package management into one. Sort of like docker, bazel, pip/npm/brew all in one. The tutorial has shown some useful variations and has convinced me flakes is the way to go, but I need to spend some more time better understanding the primitives as well. I understand little of what’s going on in the flake.nix files I’ve looked at.

Facebook (Meta, whatever) announced Threads today to launch on July 6th. Given how much worse it feels like Twitter has become (my experience only), on one hand, I could see people migrating here because no great alternative has really emerged. On the other, Facebook has zero “public” products where the user experience is even palatable for me, personally (I use Whatsapp but it’s basically iMessage). Instagram and Facebook both rapidly became completely intolerable for me due to their content. Maybe that is a matter of curation, but I bet, at least in some part, it’s a result of how Facebook runs their business and why Twitter never made much ad revenue compared to them (and why Reddit struggles to either). If I had to make a bet, I would bet on people migrating to Threads. Personally, I won’t until they have a webapp.

A simple shell function to setup a Python project scaffold. It’s idempotent, so it won’t overwrite an existing folder or env.

pproj () {
    mkdir -p $1
    cd $1
    python -m venv env
    . env/bin/activate
}

I’ve been following Jason’s working experimenting with different abstractions for constructing prompts and structuring responses. I’ve long felt that building prompts with strings is not the type of developer experience that will win the day. On the other hand, I’m weary of the wrong abstraction that would move the developer too far away from the actual prompt, which would make it harder to construct good prompts and steer the model. I’m not sure if this is an ORM vs. SQL conversation or if there’s an abstraction that exist as a happy medium.

Did some work with Clojure destructuring.

Unpack values into specific variables.

user=> (let [[a b c] [1 2 3]] (println a b c))
1 2 3
nil

Unpack the first N items, ignoring the rest.

user=> (let [[a b] [1 2 3]] (println a b))
1 2
nil

Unpack the first N items to variables and capture the rest as an array.

user=> (let [[a b & rst] [1 2 3 4 5]] (println a b rst))
1 2 (3 4 5)
nil

Doing math with a non-big decimal number and a big decimal number can cast down.

user=> (* 0.1 101M)
10.100000000000001
user=> (bigdec (* 0.1 101M))
10.100000000000001M

Heard the phrase “if someone wins the lottery” used today to describe a teammate leaving a team. I much prefer this to the more morbid alternatives.


I tried gpt-engineer today. I liked the approach and the setup instructions are good. I think I remember needing to use Python 3.11 instead of 3.8 that I was running, but beyond that the readme instructions we on point.

Process

You start by creating a project folder with a plaintext prompt. You start the script and point it at your project folder. The program reads your prompt then uses the LM to ask clarifying questions. The clarifying questions seem pretty effective. If you answer more than of one of the predetermined questions at once, the program seems to recognizes that and removes it from the list. Finally, it creates an actual project, with source code, pretty consistently (3/3 times I tried). I used it to try and create a 1-player Scattergories CLI game or something close.

I’ve been thinking about the concept of “prompt overfitting”. In this context, there is a distinction between model overfitting and prompt overfitting. Say you want to use a large language model as a classifier. You may give it several example inputs and the expected outputs. I don’t have hard data to go by, but it feels meaningful to keep the prompt generic or abstract where possible rather than enumerating overly specific cases in a way that obfuscates the broader pattern you’re hoping to apply. I hypothesize these overly specific examples could interfere with the model output in unintended, overly restrictive ways.