I’ve been digging more into evals. I wrote a simple Claude completion function in openai/evals to better understand how the different pieces fit together. Quick and dirty code:

from anthropic import Anthropic

from evals.api import CompletionFn, CompletionResult
from evals.prompt.base import is_chat_prompt


class ClaudeChatCompletionResult(CompletionResult):
    def __init__(self, response) -> None:
        self.response = response

    def get_completions(self) -> list[str]:
        return [self.response.strip()]

class ClaudeChatCompletionFn(CompletionFn):
    def __init__(self, **kwargs) -> None:
        self.client = Anthropic()

    def __call__(self, prompt, **kwargs) -> ClaudeChatCompletionResult:
        if is_chat_prompt(prompt):
            messages = prompt
            system_prompt = next((p for p in messages if p.get("role") == "system"), None)
            if system_prompt:
                messages.remove(system_prompt)
        else:
            # I think there is a util function to do this already
            messages = [{
                "role": "user",
                "content": prompt,
            }]
        message = self.client.messages.create(
            max_tokens=1024,
            system=system_prompt["content"] if system_prompt else None,
            messages=messages,
            model="claude-3-opus-20240229",
        )
        return ClaudeChatCompletionResult(message.content[0].text)
claude/claude-3-opus:
  class: evals.completion_fns.claude:ClaudeChatCompletionFn
  args:
    completion_fn: claude-3-opus

Run with

oaieval claude/claude-3-opus extraction

on a toy eval I wrote.

{"input": [{"role": "system", "content": "You are responsible for extracting structured data from the provided unstructured data. Follow the user's instructions and output JSON only without code fences."}, {"role": "user", "content": "CONTENT: I live at 42 Wallaby Way, Sydney\nINSTRUCTIONS: extract street and city"}], "ideal": "{\"street\": \"42 Wallaby Way\",\"city\": \"Sydney\"}"}
{"input": [{"role": "system", "content": "You are responsible for extracting structured data from the provided unstructured data. Follow the user's instructions and output JSON only without code fences."}, {"role": "user","content": "CONTENT: My favorite color is blue and I was born on June 15, 1985.\nINSTRUCTIONS: extract favorite color and date of birth. format date of birth as yyyy-mm-dd"}], "ideal": "{\"favorite_color\": \"blue\",\"date_of_birth\": \"1985-06-15\"}"}
extraction:
  id: extraction.test.v0
  metrics: [accuracy]

extraction.test.v0:
  class: evals.elsuite.basic.json_match:JsonMatch
  args:
    samples_jsonl: extraction/samples.jsonl

It seems this project is moving away from the “Completion Functions” abstraction to “Solvers”.

[W]e’ve found that passing a prompt to the CompletionFn encourages eval designers to write prompts that often privileges a particular kind of Solver over others. e.g. If developing with ChatCompletion models, the eval tends to bake-in prompts that work best for ChatCompletion models. In moving from Completion Functions to Solvers, we are making a deliberate choice to write Solver-agnostic evals, and delegating any model-specific or strategy-specific code to the Solver.

In working through this exercise, a thought that came to mind often is how many different approaches we currently have for model prompting, both because of model differences (completion vs. chat) but also API design decisions. To allow for easy switching between models, using a gateway/adapter pattern to support mapping from the model/provider API to your applications’s internal API will be as critical as ever. This approach may be further complicated if your application relies on streaming responses. It seems as important as ever to use abstractions to decouple yourself from provider APIs to remain flexible to adopting future advances in models and keep your switching cost low.