diff --git a/index.html b/index.html index bda3a38..9889a02 100644 --- a/index.html +++ b/index.html @@ -1555,26 +1555,25 @@

Abstract

- Large language models with vision capabilities (VLMs), e.g., GPT-GPT-4o and Gemini-1.5 - Pro - are powering various image-text applications and scoring high on many vision-understanding benchmarks. We - propose - BlindTest, a suite of 7 visual tasks very - easy to - humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which - letter is being - circled in a word; and (d) counting the number of circles in an Olympic-like logo. Surprisingly, four - state-of-the-art - VLMs are, on average, only 58.12% accurate on our benchmark where the human expected accuracy is 100%. + Pro, + are powering various image-text applications and scoring high on many vision-understanding benchmarks, we + find that + they are surprisingly still struggling with low-level vision tasks that are easy to humans. + Specifically, on BlindTest, our suite of + 7 very + simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) + which letter is + being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are + only 58.57% + accurate on average. Sonnet-3.5 performs the best at 74.01% - accuracy. On BlindTest, VLMs struggle - with tasks that require precise - spatial information and simple counting (from 0 to 2, 2 to 5, or 5 to 10), sometimes providing an - impression of a - person with myopia seeing fine details - as blurry - and making educated guesses. + accuracy, but + this is still far from the human expected accuracy of 100%. + Across different image resolutions and line widths, VLMs consistently struggle with tasks that require + precise spatial + information and recognizing geometric primitives that overlap or are close together.

@@ -1627,7 +1626,7 @@

Overview of All Tasks

Path Following
- Mean + Mean