From 04f0e1e37b305088e9bc0584cc6a5e8d74f4d56a Mon Sep 17 00:00:00 2001 From: VLMsAreBlind Date: Sun, 7 Jul 2024 23:53:17 -0600 Subject: [PATCH] Update --- index-preview.html | 752 +++++++++++++++++- .../GridExamples/blank_grid_3x3_500_10.svg | 7 + .../GridExamples/blank_grid_3x4_500_10.svg | 7 + .../GridExamples/blank_grid_4x3_500_10.svg | 7 + .../GridExamples/blank_grid_4x4_2000_20.svg | 7 + .../GridExamples/blank_grid_4x5_2000_20.svg | 7 + .../GridExamples/blank_grid_5x4_2000_20.svg | 7 + .../GridExamples/blank_grid_5x5_2000_20.svg | 7 + .../GridExamples/blank_grid_5x6_2000_20.svg | 7 + .../GridExamples/blank_grid_6x5_2000_20.svg | 7 + .../GridExamples/blank_grid_6x6_2000_20.svg | 7 + .../GridExamples/blank_grid_6x7_2000_20.svg | 7 + .../GridExamples/blank_grid_7x6_2000_20.svg | 7 + .../GridExamples/blank_grid_7x7_2000_20.svg | 7 + .../GridExamples/blank_grid_7x8_2000_10.svg | 7 + .../GridExamples/blank_grid_8x7_1250_20.svg | 7 + .../GridExamples/blank_grid_9x10_1250_20.svg | 7 + .../GridExamples/blank_grid_9x10_500_20.svg | 7 + .../GridExamples/blank_grid_9x9_1250_20.svg | 7 + .../GridExamples/text_grid_3x3_2000_10.svg | 7 + .../GridExamples/text_grid_3x4_2000_10.svg | 7 + .../GridExamples/text_grid_4x3_2000_10.svg | 7 + .../GridExamples/text_grid_4x4_2000_20.svg | 7 + .../GridExamples/text_grid_4x4_500_20.svg | 7 + .../GridExamples/text_grid_4x5_2000_20.svg | 7 + .../GridExamples/text_grid_4x5_500_20.svg | 7 + .../GridExamples/text_grid_5x4_2000_20.svg | 7 + .../GridExamples/text_grid_5x4_500_20.svg | 7 + .../GridExamples/text_grid_5x5_2000_20.svg | 7 + .../GridExamples/text_grid_5x6_2000_20.svg | 7 + .../GridExamples/text_grid_6x5_2000_20.svg | 7 + .../GridExamples/text_grid_6x6_2000_20.svg | 7 + .../GridExamples/text_grid_6x7_2000_10.svg | 7 + .../GridExamples/text_grid_6x7_2000_20.svg | 7 + .../GridExamples/text_grid_7x6_1250_10.svg | 7 + .../GridExamples/text_grid_7x6_2000_20.svg | 7 + .../GridExamples/text_grid_7x7_2000_10.svg | 7 + .../GridExamples/text_grid_7x7_2000_20.svg | 7 + .../GridExamples/text_grid_8x7_500_10.svg | 7 + .../GridExamples/text_grid_8x9_1250_20.svg | 7 + .../GridExamples/text_grid_8x9_2000_20.svg | 7 + .../GridExamples/text_grid_9x8_1250_20.svg | 7 + .../GridExamples/text_grid_9x9_2000_10.svg | 7 + .../GridExamples/text_grid_9x9_500_10.svg | 7 + ...1_9d3dd747-60f9-469f-a25b-66755fec568f.svg | 7 + ...3_07847062-0a5b-40be-ad4c-fb8974747dc8.svg | 7 + ...3_2983a529-8a38-49f7-b555-75539a32fc88.svg | 7 + ...1_feb230d1-ab0c-4600-b782-bd1a01191cbd.svg | 7 + ...2_b982336c-68f6-4fce-9f7b-81b789b7c60a.svg | 7 + ...3_79870125-c92a-4780-81b3-aab8f8d7c4c6.svg | 7 + ...2_97592c8d-7d89-41e5-ba2f-f03bacd4bd21.svg | 7 + ...2_ac853acf-1e75-422b-8074-1d18a8f3c757.svg | 7 + ...1_4bde93d6-5c5c-421c-9fe0-b3b7e155276f.svg | 7 + ...1_8aa774ab-55ad-43e5-810d-8ce78fdaa8fd.svg | 7 + ...1_e5198d80-d54f-4957-9ca9-c76a3a0bf654.svg | 7 + ...1_e851a144-b258-4905-a50f-eb148a72bb65.svg | 7 + ...2_8c29c646-fbd9-4f16-9178-b825b9aa4851.svg | 7 + ...2_931b8867-3028-48fc-a81f-f417f9261d9d.svg | 7 + ...2_e12d964a-cdf0-4168-bc94-3b7d0f7f051f.svg | 7 + ...3_136530b6-28e5-47e1-ac40-cd998431f233.svg | 7 + ...3_86f7d918-db10-450a-8600-600e0f3eb474.svg | 7 + ...3_941abb65-fa92-45ac-9ecc-d155646cbbe6.svg | 7 + ...1_73b2dfe1-2604-4c41-8b5a-131643cc1c0c.svg | 7 + ...2_9bd2d5ae-ffd2-4146-ae57-12c4781fb987.svg | 7 + ...3_ee2e5983-b36e-42b9-882b-aa86660a2ed0.svg | 7 + ...1_dace7968-cafb-409c-983f-d0051b98cffc.svg | 7 + ...2_47f692ff-9e7b-4460-bff7-bef1e7eeb57d.svg | 7 + ...3_cbcaadde-80af-43c7-84d6-a6d5489b70f8.svg | 7 + ...1_35a6279d-b820-4396-8881-14c16a750a11.svg | 7 + ...1_7e303a0b-310d-4272-bc95-6b46e6eef035.svg | 7 + ...2_96f9f10e-a561-41c8-8275-e0f8e98f9e01.svg | 7 + ...3_208c5704-5d9c-4afc-8fa3-e1f1a5acbb8e.svg | 7 + ...1_7d70e254-e6b5-497b-af2c-78047155e153.svg | 7 + ...2_3e537b58-0e16-4920-8cf6-217c0362977b.svg | 7 + ...3_48f7085a-7fc5-40bf-9dbe-b393b7f4c7a3.svg | 7 + ...2_369709ec-3332-4a19-96b4-a848bec5a679.svg | 7 + ...2_ff376d76-8313-422b-9968-fbb311c66f55.svg | 7 + ...1_46a3cf36-3ec2-4f83-b5a9-1121e82ad689.svg | 7 + ...1_caf3c976-4326-4abd-abc8-109140b8f16a.svg | 7 + ...3_67b675e5-1cb0-4b68-92fe-647677991ffa.svg | 7 + ...3_d042a607-3b77-4eec-87c0-9f8b8c2cc118.svg | 7 + ...3_c06eac36-a4d1-405f-b445-9b9790b0366c.svg | 7 + ...2_b982336c-68f6-4fce-9f7b-81b789b7c60a.svg | 7 + ...1_921cbd8a-9833-433a-9104-9b568bf10ac7.svg | 7 + 84 files changed, 1324 insertions(+), 9 deletions(-) create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_3x3_500_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_3x4_500_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_4x3_500_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_4x4_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_4x5_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_5x4_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_5x5_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_5x6_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_6x5_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_6x6_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_6x7_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_7x6_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_7x7_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_7x8_2000_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_8x7_1250_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_9x10_1250_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_9x10_500_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/blank_grid_9x9_1250_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_3x3_2000_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_3x4_2000_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_4x3_2000_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_4x4_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_4x4_500_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_4x5_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_4x5_500_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_5x4_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_5x4_500_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_5x5_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_5x6_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_6x5_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_6x6_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_6x7_2000_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_6x7_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_7x6_1250_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_7x6_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_7x7_2000_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_7x7_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_8x7_500_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_8x9_1250_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_8x9_2000_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_9x8_1250_20.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_9x9_2000_10.svg create mode 100644 static/images/CountGridRowColumns/GridExamples/text_grid_9x9_500_10.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_1024_linewidth_10_path_1_9d3dd747-60f9-469f-a25b-66755fec568f.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_1024_linewidth_10_path_3_07847062-0a5b-40be-ad4c-fb8974747dc8.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_1024_linewidth_10_path_3_2983a529-8a38-49f7-b555-75539a32fc88.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_1024_linewidth_20_path_1_feb230d1-ab0c-4600-b782-bd1a01191cbd.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_1024_linewidth_20_path_2_b982336c-68f6-4fce-9f7b-81b789b7c60a.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_1024_linewidth_20_path_3_79870125-c92a-4780-81b3-aab8f8d7c4c6.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_10_path_2_97592c8d-7d89-41e5-ba2f-f03bacd4bd21.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_10_path_2_ac853acf-1e75-422b-8074-1d18a8f3c757.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_1_4bde93d6-5c5c-421c-9fe0-b3b7e155276f.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_1_8aa774ab-55ad-43e5-810d-8ce78fdaa8fd.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_1_e5198d80-d54f-4957-9ca9-c76a3a0bf654.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_1_e851a144-b258-4905-a50f-eb148a72bb65.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_2_8c29c646-fbd9-4f16-9178-b825b9aa4851.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_2_931b8867-3028-48fc-a81f-f417f9261d9d.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_2_e12d964a-cdf0-4168-bc94-3b7d0f7f051f.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_3_136530b6-28e5-47e1-ac40-cd998431f233.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_3_86f7d918-db10-450a-8600-600e0f3eb474.svg create mode 100644 static/images/SubwayConnections/appendix_figs/pixels_512_linewidth_20_path_3_941abb65-fa92-45ac-9ecc-d155646cbbe6.svg create mode 100644 static/images/SubwayConnections/construction/pixels_1024_linewidth_10_path_1_73b2dfe1-2604-4c41-8b5a-131643cc1c0c.svg create mode 100644 static/images/SubwayConnections/construction/pixels_1024_linewidth_10_path_2_9bd2d5ae-ffd2-4146-ae57-12c4781fb987.svg create mode 100644 static/images/SubwayConnections/construction/pixels_1024_linewidth_10_path_3_ee2e5983-b36e-42b9-882b-aa86660a2ed0.svg create mode 100644 static/images/SubwayConnections/construction/pixels_1024_linewidth_20_path_1_dace7968-cafb-409c-983f-d0051b98cffc.svg create mode 100644 static/images/SubwayConnections/construction/pixels_1024_linewidth_20_path_2_47f692ff-9e7b-4460-bff7-bef1e7eeb57d.svg create mode 100644 static/images/SubwayConnections/construction/pixels_1024_linewidth_20_path_3_cbcaadde-80af-43c7-84d6-a6d5489b70f8.svg create mode 100644 static/images/SubwayConnections/construction/pixels_512_linewidth_10_path_1_35a6279d-b820-4396-8881-14c16a750a11.svg create mode 100644 static/images/SubwayConnections/construction/pixels_512_linewidth_10_path_1_7e303a0b-310d-4272-bc95-6b46e6eef035.svg create mode 100644 static/images/SubwayConnections/construction/pixels_512_linewidth_10_path_2_96f9f10e-a561-41c8-8275-e0f8e98f9e01.svg create mode 100644 static/images/SubwayConnections/construction/pixels_512_linewidth_10_path_3_208c5704-5d9c-4afc-8fa3-e1f1a5acbb8e.svg create mode 100644 static/images/SubwayConnections/construction/pixels_512_linewidth_20_path_1_7d70e254-e6b5-497b-af2c-78047155e153.svg create mode 100644 static/images/SubwayConnections/construction/pixels_512_linewidth_20_path_2_3e537b58-0e16-4920-8cf6-217c0362977b.svg create mode 100644 static/images/SubwayConnections/construction/pixels_512_linewidth_20_path_3_48f7085a-7fc5-40bf-9dbe-b393b7f4c7a3.svg create mode 100644 static/images/SubwayConnections/dataset_examples/pixels_1024_linewidth_10_path_2_369709ec-3332-4a19-96b4-a848bec5a679.svg create mode 100644 static/images/SubwayConnections/dataset_examples/pixels_1024_linewidth_10_path_2_ff376d76-8313-422b-9968-fbb311c66f55.svg create mode 100644 static/images/SubwayConnections/dataset_examples/pixels_1024_linewidth_20_path_1_46a3cf36-3ec2-4f83-b5a9-1121e82ad689.svg create mode 100644 static/images/SubwayConnections/dataset_examples/pixels_1024_linewidth_20_path_1_caf3c976-4326-4abd-abc8-109140b8f16a.svg create mode 100644 static/images/SubwayConnections/dataset_examples/pixels_1024_linewidth_20_path_3_67b675e5-1cb0-4b68-92fe-647677991ffa.svg create mode 100644 static/images/SubwayConnections/dataset_examples/pixels_1024_linewidth_20_path_3_d042a607-3b77-4eec-87c0-9f8b8c2cc118.svg create mode 100644 static/images/SubwayConnections/pixels_1024_linewidth_10_path_3_c06eac36-a4d1-405f-b445-9b9790b0366c.svg create mode 100644 static/images/SubwayConnections/pixels_1024_linewidth_20_path_2_b982336c-68f6-4fce-9f7b-81b789b7c60a.svg create mode 100644 static/images/SubwayConnections/pixels_512_linewidth_10_path_1_921cbd8a-9833-433a-9104-9b568bf10ac7.svg diff --git a/index-preview.html b/index-preview.html index 3db7685..3df4d11 100644 --- a/index-preview.html +++ b/index-preview.html @@ -995,6 +995,14 @@ .special-word.ackn { color: #2980b9; } + + .spacer { + height: 50px; + } + + .spacer-2 { + height: 100px; + } @@ -1271,11 +1279,10 @@

Results

-

Qualitative samples

- +

Qualitative samples

@@ -1790,7 +1797,7 @@

Prompts

Groundtruth

- Letters need to match predicted letters exactly (case-sensitive). + Letters need to match predicted letters exactly (case-insensitive).

@@ -1854,12 +1861,9 @@

Results

89.22 - -

Qualitative samples

-
- +

Qualitative samples

@@ -2148,10 +2152,10 @@

Results

-

Qualitative samples

+

Qualitative samples

@@ -2393,11 +2397,11 @@

Results

-

Qualitative samples

+

Qualitative samples

@@ -2520,6 +2524,736 @@

Count total number of squares in the image.

+ +
+ +
+
+ +

Task 6: Counting the rows and columns of a grid

+ + +
+

+ The results from prior tasks show VLMs cannot always count shapes that are overlapping (Task 4) or nested + (Task 5). + What about adjacent shapes? Here, we tile up shapes (specifically, squares) into a grid and challenge VLMs to + count—a + task that is supposedly simple to VLMs given their remarkable performance (≥ 90% accuracy) on DocVQA, which + includes + many questions with tables. + To simplify the task, we ask models to count the number of rows and columns in a given table. +

+ +

Images

+

+ A grid may have N×N, N×N', or N'×N cells, where N∈{3, 4, 5, 6, 7, 8, 9} and N' = N + 1. + Each grid is rendered with two different line-thicknesses on a canvas of size C×C where C∈{500, 1250, 2000}px. + Besides empty grids, we also replicate the procedure to make grids contain text (which is more common in + real-world + tables) where each cell contains a single random word. + Two versions combined have 2×222 = 444 images. +

+
+
+
+ Text grid 3x3 +
Text grid (3x3)
+
+
+
+
+ Text grid 3x4 +
Text grid (3x4)
+
+
+
+
+ Empty grid 4x4 +
Empty grid (4x4)
+
+
+
+
+ Empty grid 4x5 +
Empty grid (4x5)
+
+
+
+
Fig. 9: Examples of grid images used in the task, showing text-filled and empty grids with various + dimensions. +
+
+ +

Prompts

+
+

+ We ask each question using two different wordings: +

+
    +
  1. "Count the number of rows and columns and answer with numbers in curly brackets. For example, rows={5} + columns={6}"
  2. +
  3. "How many rows and columns are in the table? Answer with only the numbers in a pair (row, column), + e.g., + (5,6)"
  4. +
+ +

Groundtruth

+
+

+ Answers include both the number of rows and columns. An answer is correct when both column and row counts + are + correctly predicted. +

+
+ +

Results

+

+ The following table shows the performance of the four models on the task of counting rows and columns in + grids. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Grid type +
+ GPT-4o +
+
+
+ Gemini-1.5 Pro +
+
+
+ Sonnet-3 +
+
+
+ Sonnet-3.5 +
+
Blank26.1325.7525.0059.84
Text53.0345.8347.3488.68
Average39.5835.7936.1774.26
+ +
+ +
+

Qualitative samples

+
+
+
+

Count the number of rows and columns and answer with numbers in curly brackets. For example, rows={5} + columns={6} +

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Grid 1 + Grid 2 + Grid 3 + Grid 4 + Grid 5 + Grid 6 +
4×4 + 6×6 + 7×7 + 6×6 + 6×6 + 6×6 +
5×5 + 6×6 + 7×7 + 10×10 + 5×6 + 10×10 +
5×5 + 7×8 + 6×6 + 9×9 + 6×6 + 9×12 +
4×5 + 6×7 + 7×7 + 8×7 + 5×6 + 8×8 +
+
+
+ GPT-4o GPT-4o +
+
+ Gemini-1.5 Gemini-1.5 + Pro +
+
+ Sonnet-3 Sonnet-3 +
+
+ Sonnet-3 Sonnet-3.5 +
+
+
+
Fig. 12: Examples from the benchmark show that models consistently fail at counting rows and + columns of + blank grids.
+
+
+ +
+ +
+
+
+
+

How many rows and columns are in the table? Answer with only the numbers in a pair (row, column), + e.g., (5,6).

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Grid 1 + Grid 2 + Grid 3 + Grid 4 + Grid 5 + Grid 6 +
4×4 + 4×5 + 5×4 + 5×6 + 6×8 + 7×8 +
4×4 + 4×5 + 5×4 + 5×6 + 6×8 + 7×8 +
4×4 + 5×5 + 5×4 + 6×6 + 7×7 + 8×7 +
4×4 + 4×5 + 5×4 + 5×6 + 6×7 + 7×7 +
+
+
+ GPT-4o GPT-4o +
+
+ Gemini-1.5 Gemini-1.5 + Pro +
+
+ Sonnet-3 Sonnet-3 +
+
+ Sonnet-3 Sonnet-3.5 +
+
+
+
Fig. 13: When text is included in the cells of the grid, the performance of all VLMs improves, + especially + Sonnet-3.5. +
+
+ +
+
+ + + +
+ +
+
+ +

Task 7: Following single-colored paths

+ + + +
+

+ It is important for VLMs to be able to follow paths in order to read maps or charts, interpret graphs, and + understand + user notations (e.g., arrows) in input images. To assess path-following capability, this task asks models to + count the + unique-color paths between two given stations in a simplified subway map. This is another easy-to-humans task + that + challenges VLMs significantly. +

+ +

Images

+

+ We create each subway map on an image of size C×C, where C ∈ {512, 1024}px. We write 4 station names (A, B, C, + D) at 4 + fixed coordinates. We divide the canvas into an invisible grid of 18×18 cells and initialize 3 path-starting + points + C/18px away from each station. We draw a path, using the depth-first search algorithm starting from a random + station + and a random starting point, where a valid move is one cell in any direction: North, south, east or west. We + repeat + the process so that each station has exactly N ∈ {1, 2, 3} outgoing paths, for a total of 180 maps. +

+ +
+
+
+ Station with 1 path +
Station with 1 path, and linewidth 10px
+
+
+
+
+ Station with 2 paths +
Station with 2 paths, and linewidth 20px
+
+
+
+
+ Station with 2 paths +
Station with 2 paths, and linewidth 20px
+
+
+
+
+ Station with 3 paths +
Station with 3 paths, and linewidth 10px
+
+
+
+
Fig. 14: Examples of subway map images used in the task, showing different numbers of paths and + variations + in path thickness.
+
+ +

Prompts

+
+

+ We ask each question using two different wordings: +

+
    +
  1. "How many single-colored paths go from A to C? Answer with a number in curly brackets, e.g., {3}" +
  2. +
  3. "Count the one-colored routes that go from A to C. Answer with a number in curly brackets, e.g., + {3}." +
  4. +
+ +

Groundtruth

+
+

+ Answers are ∈ {0, 1, 2, 3} (random-baseline accuracy: 25%). +

+
+ +

Results

+

+ The following table shows the performance of the four models on the task of counting single-colored paths + between + stations. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Paths +
+ GPT-4o +
+
+
+ Gemini-1.5 Pro +
+
+
+ Sonnet-3 +
+
+
+ Sonnet-3.5 +
+
167.5085.4123.7595.00
244.3728.7537.1856.25
336.7125.7815.4225.39
Average45.8940.0123.7850.18
+
+
+

Qualitative samples

+
+
+
+

How many single-color paths go from A to D? Answer with a number in curly brackets e.g. {3}

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Subway Map 1Subway Map 2Subway Map 3Subway Map 4Subway Map 5Subway Map 6
1 + 1 + 2 + 3 + 2 + 1 +
2 + 2 + 4 + 1 + 1 + 4 +
2 + 1 + 3 + 2 + 4 + 4 +
1 + 1 + 3 + 3 + 2 + 3 +
+
+
+ GPT-4o GPT-4o +
+
+ Gemini-1.5 Gemini-1.5 + Pro +
+
+ Sonnet-3 Sonnet-3 +
+
+ Sonnet-3 Sonnet-3.5 +
+
+
+
Fig. 15: Some VLMs (Gemini-1.5, + Sonnet-3) + surprisingly fail in even extremely easy cases (leftmost). + As the + number of paths exiting each station increases, VLMs tend to perform worse. +
+ +
+
+
+ + + + +