It’s necessary to pay attention to the shape of a language model’s response when incorporating it as a component in a software application. You can’t programmatically tap into the power of a language model if you can’t reliably parse its response. In the past, I have mostly used a combination of prose and examples to define the shape of the language model response in my prompts. Something like:
Respond using JSON with the following shape:
{ ... }
I was curious if I could use something more familiar and structured to archive a similar purpose.
The familiar#
pydantic
is a popular library for data validation in Python and is an integral part of a number of leading libraries including fastapi
.
We can use pydantic
in a toy example to show what it could look like to use it define the desired output structure from the language model.
In the following script, we’ll extract data adhering to the passed pydantic
schema for the class Address
from the following input text.
We’ll do this by passing the code literal of the pydantic
model into the language model prompt, then instruct the model to respond with schema-compliant JSON, that we can unpack into an instance of the schema.
If the response from the language model is not schema-compliant, the code will throw a pydantic.error_wrappers.ValidationError
.
The text#
Last weekend, I visited my friend who lives in a charming little house on Oak Avenue. The address was 3578 Oak Avenue, and it was easy to find with the help of my GPS. While typing in the address, I discovered there was more than one 3578 Oak Ave, but once I entered right zip code, 90011, I found it. The house was surrounded by trees and had a beautiful garden in the front. Inside, my friend had decorated the place with vintage furniture and colorful paintings. We spent the day catching up and enjoying a cup of coffee in the cozy living room. It was a lovely visit and my first time in Los Angeles, and I can’t wait to go back and see my friend’s new garden in full bloom.
The code#
import inspect
import json
import openai
from typing import Type
from pydantic import BaseModel
class Address(BaseModel):
street: str # street name and number
city: str
state: str
zip_code: str
PROMPT: str = """{input_text}
From the above input, extract data and generate a JSON object of type `{type}` that adheres to the following schema.
Schema:
{schema}
Output the JSON and the JSON only.
"""
def build_prompt(text: str, cls: Type[BaseModel]) -> str:
return PROMPT.format(
input_text=text,
type=cls.__name__,
schema=inspect.getsource(cls),
)
def extract_data(text: str, cls: Type[BaseModel]) -> BaseModel:
prompt: str = build_prompt(text, cls)
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": prompt}
],
temperature=0,
)
result = completion.choices[0].message.content
result_obj: dict = json.loads(result)
return cls.parse_obj(result_obj)
def main():
text = "<input text from above>"
return extract_data(text, Address)
if __name__ == "__main__":
print(main())
The full prompt after formatting looks as follows:
Last weekend, I visited my friend who lives in a charming little house on Oak Avenue. The address was 3578 Oak Avenue, and it was easy to find with the help of my GPS. While typing in the address, I discovered there was more than one 3578 Oak Ave, but once I entered right zip code, 90011, I found it. The house was surrounded by trees and had a beautiful garden in the front. Inside, my friend had decorated the place with vintage furniture and colorful paintings. We spent the day catching up and enjoying a cup of coffee in the cozy living room. It was a lovely visit and my first time in Los Angeles, and I can’t wait to go back and see my friend’s new garden in full bloom.
From the above input, extract data as a JSON object of type
Address
that adheres to the following schema. Infer and fill in missing data values if possible Schema:class Address(BaseModel): street: str # street name and number city: str state: constr(min_length=2, max_length=2) zip_code: str
JSON output:
Running the script yields the following:
โฏ python script.py
street='3578 Oak Avenue' city='Los Angeles' state='CA' zip_code='90011'
First thing, it works and pretty well! Second, there are a few subtleties that help make this approach work as well as it does.
- From the prompt, “Infer and fill in missing data values if possible”: The state is not specified in the input text, but the language model infers it based on the instructions
state: constr(min_length=2, max_length=2)
: It would be hard to fault the model if it inferred the state as “California”, but we use an additional schema constraint to ensure the two character abbreviation is usedstreet: str # street name and number
: After testing the script a few times, I ran into an issue where the resultingstreet
was parsed as “Oak Avenue”. Adding this comment helps ensure that the number and street name are extracted together.
Generalizing#
This approach is already relatively general.
One can easily swap the BaseModel
subclass passed into extract_data
for a different data structure.
There’s a bit more work to do here to extract data adhering to the schema of a model with field values that are other models.
For example:
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
class User(BaseModel):
first_name: str
last_name: str
address: Address
In order to parse out data for a User
for an input text, you’d want to include the schemas for both the Address
and the User
.
JSON Schema#
I’ve written about experimenting with JSON Schema to construct schema-adherent responses from a language model. You can also obtain and use JSON schema from pydantic
classes to use for parsing and schema-adherent response construction:
Address.schema_json()