I’ve been setting up the foundations to add node summaries to Delta. Ideally, I will use the same model to create the node summaries as I use to generate the responses since this will keep the model dependencies minimal. However, my early experiments have yielded some inconsistency in how a shared prompt behaves across models. To try and understand this and smooth it out as much as possible, I plan to set up evals to ensure the summaries are

  • within a certain length
  • don’t include direct references to the user or assistant
  • are a single sentence or fragment
  • include the relevant topic(s) discussed

I am currently looking for a straightforward, local option for running evals. I’d like to avoid cloud products and testing across multiple models needs to be straightforward. My ideal output would be a matrix of the different models run and the results of the different eval checks against the respective model outputs given the same system and user prompts.