This past week, OpenAI added function calling to their SDK. This addition is exciting because it now incorporates schema as a first-class citizen in making calls to OpenAI chat models. As the example code and naming suggest, you can define a list of functions and schema of the parameters required to call them and the model will determine whether a function needs to be invoked in the context of the completion, then return JSON adhering to the schema defined for the function. If you read anything else I’ve written you probably know what I’m going to try and do next: let’s use a function to extract structured data from an unstructured input.

Extract a recipe as structured data#

I found this recipe and I want to try it out. I want to parse the content on the page and extract the recipe in a form that I could easily render on a personal recipe site. I quickly checked the page and it looks like most of the content is nested within an html element with the class "content". Here is some Python code to extract all the text from the HTML, eliminating the markup:

import requests
from bs4 import BeautifulSoup

url = "https://christieathome.com/blog/kl-hokkien-mee/"
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
# Find all elements with the class name "content"
content_elements = soup.find(class_="content")
content = content_elements.get_text(strip=True, separator=" ")
print(content)

This code outputs a big block of text, a lot of which isn’t ingredients or instructions for the recipe.

Home ยป Recipes ยป Mains KL Hokkien Mee Last Modified: June 28, 2022 - Published by: christieathome ...

Before we get into calling the language model, let’s write a schema for the data we’d like to extract from the page’s content. We’ll use pydantic because it can easily be converted to a JSON schema.

from pydantic import BaseModel
from typing import (
    List,
    Type,
)

class Ingredient(BaseModel):
    name: str
    quantity: float
    unit: str

class Recipe(BaseModel):
    title: str
    description: str
    ingredients: List[Ingredient]
    steps: List[str]

Nothing too surprising so far. Now is the interesting part. Let’s wire up a call to OpenAI that uses functions and our Recipe schema to structure the response:

import json
import openai

messages = [{"role": "user", "content": content}]
functions = [
    {
        "name": "print_json_data",
        "description": "Print JSON data extracted from the input",
        "parameters": Recipe.schema(),
    },
]
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-16k",
    messages=messages,
    functions=functions,
    function_call="auto",
)
response_message = response["choices"][0]["message"]
arguments = response_message["function_call"]["arguments"]
print(json.loads(arguments))

Re-writing with a bit of refactor:

import openai
import json
import requests

from bs4 import BeautifulSoup
from pydantic import BaseModel
from typing import List

class Ingredient(BaseModel):
    name: str
    quantity: float
    unit: str

class Recipe(BaseModel):
    title: str
    description: str
    ingredients: List[Ingredient]
    steps: List[str]

def run_conversation(content):
    messages = [{"role": "user", "content": content}]
    functions = [
        {
            "name": "print_minified_json_data",
            "description": "Print minified JSON data extracted from the input",
            "parameters":  Recipe.schema(),
        },
    ]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-16k",
        messages=messages,
        functions=functions,
        function_call="auto",
    )
    response_message = response["choices"][0]["message"]
    arguments = response_message["function_call"]["arguments"]
    return Recipe(**json.loads(arguments))

def get_page_content():
    url = "https://christieathome.com/blog/kl-hokkien-mee/"
    response = requests.get(url)
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')
    # Find  element with the class name "content"
    content_element = soup.find(class_="content")
    return content_element.get_text(strip=True, separator=" ")

def main():
    content = get_page_content()
    print(content)
    print(run_conversation(content))


if __name__ == "__main__":
    main()

Here is where I started to run into problems. When running the script, I get the following error:

json.decoder.JSONDecodeError: Expecting ',' delimiter: line 7 column 56 (char 348)

Inspecting the JSON output of the script, we see the model isn’t returning valid JSON:

{
    "title": "...",
    "description": "...",
    "ingredients": [
        ...
        {"name": "large shrimp", "unit": "cup", "quantity": 3/4},
        ...
    ]
}

We see "quantity": 3/4 isn’t valid JSON. We can try to steer the model adding a description to the pydantic field:

class Ingredient(BaseModel):
    name: str
    quantity: float = Field(
        description="float value, must be a valid JSON type. for example: 0.75, never 3/4"
    )
    unit: str

This modifies the JSON schema in the following way:

{
    "title": "Recipe",
    "type": "object",
    "properties": {
        ...
    },
    "required": [
        ...
    ],
    "definitions": {
        "Ingredient": {
            "title": "Ingredient",
            "type": "object",
            "properties": {
                ...
                "quantity": {
                    "title": "Quantity",
                    "description": "float value, must be a valid JSON type. for example: 0.75, never 3/4",
                    "type": "number"
                },
                ...
            },
            "required": [
                ...
            ]
        }
    }
}

Unfortunately, this doesn’t resolve the invalid JSON issue. However, switching from gpt-3.5-turbo-16k to gpt-4-0613 (and removing the Field description) yields JSON that adheres to the input schema. Still, GPT-4 models are slower and more expensive than 3.5 models, so there is motivation to try and get this working with the latter.

Taking an approach I’ve tried previously, it seems like we can get more reliable results with gpt-3.5-turbo-16k.

import openai
import json
import requests

from bs4 import BeautifulSoup
from pydantic import BaseModel
from typing import List

class Ingredient(BaseModel):
    name: str
    quantity: float
    unit: str

class Recipe(BaseModel):
    title: str
    description: str
    ingredients: List[Ingredient]
    steps: List[str]

def run_conversation(content):
    prompt = f"""
Extract input content as JSON adhering to the following schemas

class Ingredient(BaseModel):
    name: str
    quantity: float
    unit: str

# extract this schema
class Recipe(BaseModel):
    title: str
    description: str
    ingredients: List[Ingredient]
    steps: List[str]

{content}

Respond with only JSON.
"""
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-16k",
        messages=messages,
    )
    response_message = response["choices"][0]["message"]
    return Recipe(**json.loads(response_message.content))

def get_page_content():
    url = "https://christieathome.com/blog/kl-hokkien-mee/"
    response = requests.get(url)
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')
    # Find all elements with the class name "content"
    content_elements = soup.find(class_="content")
    return content_elements.get_text(strip=True, separator=" ")

def main():
    content = get_page_content()
    print(content)
    print(run_conversation(content))


if __name__ == "__main__":
    main()

Takeaways#

On one hand, it’s great to see OpenAI training models to better integrate with emerging language model use cases like function invocation and schema extraction. On the other, OpenAI acknowledges this approach doesn’t always work in their documentation:

the model may generate invalid JSON or hallucinate parameters

Previous techniques I’ve explored for schema extraction seem to produce more consistent results, even with less advanced models.