I finally found some time to run a more comprehensive evals of Connections with one guess at a time and using Python code to validate the guesses and give feedback.
I ran about 100 puzzles with gpt-4o-mini
, gp-4o
, and claude-3-5-sonnet
, but it became clear that Sonnet was going to perform the best, so I decide to only complete the 466 puzzles released as of today with Sonnet.
This wasn’t cheap but it was interesting to see the results.
I’m going to write up some more comprehensive findings and push the code soon.