While I didn’t have much success getting gpt-4o to perform Task 1 - Counting Line Intersection from the Vision Language Models Are Blind paper, I pulled down some code and did a bit of testing with Claude 3.5 Sonnet. The paper reports the following success rate for Sonnet for this line intersection task:

ThicknessSonnet 3.5
280.00
379.00
473.00
Average77.33

I used the code from the paper to generate 30 similar images with line thickness 4 of intersecting (or not) lines. I chose a thickness of 4 because this was the worst performing thickness according to the paper. With these classified manually (I didn’t realize the configurations.json file already had ground truths in it), I ran the prompt from the paper against these images using Sonnet.

How many times do the blue and red lines intersect?

My results were much better than those reported by the paper.

Results summary:
Expected count 0: 9/9 correct (Accuracy: 100.00%)
Expected count 1: 10/11 correct (Accuracy: 90.91%)
Failed images for count 1:
  - ./images/1/image_14_thickness_4.png
Expected count 2: 10/10 correct (Accuracy: 100.00%)
Overall accuracy: 96.67%

Sonnet only made a single error out of 30 images. I need to dig deeper into the paper to ensure my methodology is the same as the authors’, but I was surprised to see such a divergence here since the implementation seemed straightforward. Hopefully, I can find a bit more time to dig into these discrepancies.