I’ve been setting up the foundations to add node summaries to Delta. Ideally, I will use the same model to create the node summaries as I use to generate the responses since this will keep the model dependencies minimal. However, my early experiments have yielded some inconsistency in how a shared prompt behaves across models. To try and understand this and smooth it out as much as possible, I plan to set up evals to ensure the summaries are
- within a certain length
- don’t include direct references to the user or assistant
- are a single sentence or fragment
- include the relevant topic(s) discussed
I am currently looking for a straightforward, local option for running evals. I’d like to avoid cloud products and testing across multiple models needs to be straightforward. My ideal output would be a matrix of the different models run and the results of the different eval checks against the respective model outputs given the same system and user prompts.