Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
taesiri committed Nov 1, 2024
1 parent aad5b06 commit 24e269a
Showing 1 changed file with 187 additions and 8 deletions.
195 changes: 187 additions & 8 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -681,12 +681,6 @@
margin-right: 5px;
}

figcaption {
text-align: center;
margin-top: 10px;
color: #000;
}


.result-table {
width: 100%;
Expand Down Expand Up @@ -1498,7 +1492,11 @@ <h1 class="title is-1 publication-title">
<span class="author-block"><sup>2</sup>University of Alberta,</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block">17th Asian Conference on Computer Vision (ACCV 2024)</span>
<span class="author-block">17th Asian Conference on Computer Vision (ACCV 2024)
</div>
<div>
<span class="author-block"><strong>Accepted for Oral Presentation</strong>
</span>
</div>
</span>

Expand Down Expand Up @@ -1720,6 +1718,177 @@ <h2 class="title is-3 has-text-centered">Overview of All Tasks</h2>



<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">

<p>
While large language models with vision capabilities (VLMs), e.g., <span class="model">GPT-<span
class="gpt-green">4o</span></span> and <span class="model">Gemini-<span class="gemini-blue">1.5</span>
Pro</span>,
are powering various image-text applications and scoring high on many vision-understanding benchmarks, we
find that
they are surprisingly still struggling with low-level vision tasks that are easy to humans.
Specifically, on <span class="model">Blind<span class="blindtest-purple">Test</span></span>, our suite of
7 very
simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c)
which letter is
being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are
only 58.57%
accurate on average.
<span class="model">Sonnet-<span class="sonnet35-brown">3.5</span></span> performs the best at 74.01%
accuracy, but
this is still far from the human expected accuracy of 100%.
Across different image resolutions and line widths, VLMs consistently struggle with tasks that require
precise spatial
information and recognizing geometric primitives that overlap or are close together.
</p>

</div>
</div>
</div>
</div>
</section>


<hr>
<section class="section">
<div class="container">
<div class="content-wrapper">
<h2 class="title is-3 has-text-centered">Overview of All Tasks</h2>
<div class="table-container">
<table class="performance-table results-table">
<thead>
<tr>
<th>Model</th>
<th>
<div class="task-icon-container"><img src="static/images/logo/two_lines.svg" alt="Line Intersect"
class="task-icon-small"></div>
</th>
<th>
<div class="task-icon-container"><img src="static/images/logo/two-colored-circles-svg.svg"
alt="Two Circles" class="task-icon-small"></div>
</th>
<th>
<div class="task-icon-container"><img src="static/images/logo/acknowledgement-svg.svg"
alt="Circled Letter" class="task-icon-small"></div>
</th>
<th>
<div class="task-icon-container"><img src="static/images/logo/three-circles-svg.svg"
alt="Olympic Rings" class="task-icon-small"></div>
</th>
<th>
<div class="task-icon-container"><img src="static/images/logo/overlapping-pentagons-svg.svg"
alt="Pentagon" class="task-icon-small"></div>
</th>
<th>
<div class="task-icon-container"><img src="static/images/logo/nested-squares-svg.svg"
alt="Nested Squares" class="task-icon-small"></div>
</th>
<th>
<div class="task-icon-container"><img src="static/images/logo/grid-3x4-svg.svg" alt="Grid"
class="task-icon-small">
</div>
</th>
<th>
<div class="task-icon-container"><img src="static/images/logo/subway-map-svg.svg" alt="Path Following"
class="task-icon-small"></div>
</th>
<th style="font-weight: bold; text-align: center;">Mean</th>
</tr>
</thead>
<tbody>
<tr>
<td>Random</td>
<td class="rnumber">33.33</td>
<td class="rnumber">50.00</td>
<td class="rnumber">5.77</td>
<td class="rnumber">20.00</td>
<td class="rnumber">20.00</td>
<td class="rnumber">25.00</td>
<td class="rnumber">4.55</td>
<td class="rnumber">33.33</td>
<td class="rnumber">24.00</td>
</tr>
<tr>
<td><img src="static/images/chatgpt-icon.svg" alt="GPT-4o" class="model-logo-small"></td>
<td class="rnumber">41.61</td>
<td class="rnumber">72.67</td>
<td class="rnumber">70.18</td>
<td class="rnumber">42.50</td>
<td class="rnumber">17.50</td>
<td class="rnumber">55.83</td>
<td class="rnumber">39.58</td>
<td class="rnumber">47.89</td>
<td class="rnumber">48.47</td>
</tr>
<tr>
<td><img src="static/images/google-gemini-icon.svg" alt="Gemini-1.5" class="model-logo-small"></td>
<td class="rnumber">66.94</td>
<td class="highlight rnumber">92.78</td>
<td class="highlight rnumber">92.81</td>
<td class="highlight rnumber">87.08</td>
<td class="rnumber">19.37</td>
<td class="rnumber">80.00</td>
<td class="rnumber">39.39</td>
<td class="rnumber">41.60</td>
<td class="rnumber">65.00</td>
</tr>
<tr>
<td><img src="static/images/claude-ai-icon.svg" alt="Sonnet-3" class="model-logo-small"></td>
<td class="rnumber">43.41</td>
<td class="rnumber">84.52</td>
<td class="rnumber">73.34</td>
<td class="rnumber">31.66</td>
<td class="rnumber">9.79</td>
<td class="rnumber">65.00</td>
<td class="rnumber">36.17</td>
<td class="rnumber">23.24</td>
<td class="rnumber">45.89</td>
</tr>
<tr>
<td><img src="static/images/claude-35.png" alt="Sonnet-3.5" class="model-logo-small"></td>
<td class="highlight rnumber">75.36</td>
<td class="rnumber">91.66</td>
<td class="rnumber">89.22</td>
<td class="rnumber">44.16</td>
<td class="highlight rnumber">77.29</td>
<td class="highlight rnumber">92.08</td>
<td class="highlight rnumber">74.26</td>
<td class="highlight rnumber">55.53</td>
<td class="highlight rnumber">74.94</td>
</tr>
<tr>
<td><strong>Mean</strong></td>
<td class="rnumber">56.84</td>
<td class="rnumber">85.41</td>
<td class="rnumber">81.39</td>
<td class="rnumber">51.35</td>
<td class="rnumber">30.99</td>
<td class="rnumber">73.29</td>
<td class="rnumber">47.35</td>
<td class="rnumber">42.06</td>
<td class="highlight rnumber">58.57</td>
</tr>
</tbody>
</table>
</div>
<figcaption>Accuracy (%) of each model over 7 tasks. The mean accuracy over all four models is 58.57%,
substantially
better than random chance (24%), which is computed considering each task as a single-label, N-way
classification
problem. <span class="model">Sonnet-<span class="sonnet35-brown">3.5</span></span> is the best (74.94%
accuracy)
but still far from the 100% expected accuracy.</figcaption>
</div>
</section>



<section class="section">
<div class="container">
<div class="task-grid">
Expand Down Expand Up @@ -1769,6 +1938,7 @@ <h2 class="title is-3 has-text-centered">Overview of All Tasks</h2>




<!-- TASK 1 Begins -->
<section id="task1" class="section">
<div class="container is-max-desktop">
Expand Down Expand Up @@ -1996,6 +2166,9 @@ <h2>How many times do the blue and red lines intersect?</h2>
<td style="text-align: center; vertical-align: middle;"><span class="label">1</span><span
class="cross"></span>
</td>
<td style="text-align: center; vertical-align: middle;"><span class="label">1</span><span
class="cross"></span>
</td>
</tr>
<tr>
<td style="text-align: center; vertical-align: middle;"><img src="static/images/claude-35.png"
Expand Down Expand Up @@ -2903,7 +3076,7 @@ <h2>How many circles are in the image? Answer with only the number in numerical
class="cross"></span>
</td>
<td style="text-align: center; vertical-align: middle;"><span class="label">10</span><span
class="cross"></span>
class="cross"></span>
</td>
<td style="text-align: center; vertical-align: middle;"><span class="label">5</span><span
class="cross"></span>
Expand Down Expand Up @@ -3902,4 +4075,10 @@ <h2>How many single-color paths go from A to D? Answer with a number in curly br

</body>

</html>

</html>

</html>

</html>

0 comments on commit 24e269a

Please sign in to comment.