I’ve been digging more into evals.
I wrote a simple Claude completion function in openai/evals
to better understand how the different pieces fit together.
Quick and dirty code:
from anthropic import Anthropic
from evals.api import CompletionFn, CompletionResult
from evals.prompt.base import is_chat_prompt
class ClaudeChatCompletionResult(CompletionResult):
def __init__(self, response) -> None:
self.response = response
def get_completions(self) -> list[str]:
return [self.response.strip()]
class ClaudeChatCompletionFn(CompletionFn):
def __init__(self, **kwargs) -> None:
self.client = Anthropic()
def __call__(self, prompt, **kwargs) -> ClaudeChatCompletionResult:
if is_chat_prompt(prompt):
messages = prompt
system_prompt = next((p for p in messages if p.get("role") == "system"), None)
if system_prompt:
messages.remove(system_prompt)
else:
# I think there is a util function to do this already
messages = [{
"role": "user",
"content": prompt,
}]
message = self.client.messages.create(
max_tokens=1024,
system=system_prompt["content"] if system_prompt else None,
messages=messages,
model="claude-3-opus-20240229",
)
return ClaudeChatCompletionResult(message.content[0].text)
claude/claude-3-opus:
class: evals.completion_fns.claude:ClaudeChatCompletionFn
args:
completion_fn: claude-3-opus
Run with
oaieval claude/claude-3-opus extraction
on a toy eval I wrote.
{"input": [{"role": "system", "content": "You are responsible for extracting structured data from the provided unstructured data. Follow the user's instructions and output JSON only without code fences."}, {"role": "user", "content": "CONTENT: I live at 42 Wallaby Way, Sydney\nINSTRUCTIONS: extract street and city"}], "ideal": "{\"street\": \"42 Wallaby Way\",\"city\": \"Sydney\"}"}
{"input": [{"role": "system", "content": "You are responsible for extracting structured data from the provided unstructured data. Follow the user's instructions and output JSON only without code fences."}, {"role": "user","content": "CONTENT: My favorite color is blue and I was born on June 15, 1985.\nINSTRUCTIONS: extract favorite color and date of birth. format date of birth as yyyy-mm-dd"}], "ideal": "{\"favorite_color\": \"blue\",\"date_of_birth\": \"1985-06-15\"}"}
extraction:
id: extraction.test.v0
metrics: [accuracy]
extraction.test.v0:
class: evals.elsuite.basic.json_match:JsonMatch
args:
samples_jsonl: extraction/samples.jsonl
It seems this project is moving away from the “Completion Functions” abstraction to “Solvers”.
[W]e’ve found that passing a prompt to the CompletionFn encourages eval designers to write prompts that often privileges a particular kind of Solver over others. e.g. If developing with ChatCompletion models, the eval tends to bake-in prompts that work best for ChatCompletion models. In moving from Completion Functions to Solvers, we are making a deliberate choice to write Solver-agnostic evals, and delegating any model-specific or strategy-specific code to the Solver.
In working through this exercise, a thought that came to mind often is how many different approaches we currently have for model prompting, both because of model differences (completion vs. chat) but also API design decisions. To allow for easy switching between models, using a gateway/adapter pattern to support mapping from the model/provider API to your applications’s internal API will be as critical as ever. This approach may be further complicated if your application relies on streaming responses. It seems as important as ever to use abstractions to decouple yourself from provider APIs to remain flexible to adopting future advances in models and keep your switching cost low.