Add array comparison utility to nanoarrow_testing #577

paleolimbot · 2024-08-05T18:48:53Z

There are a number of different types of array comparison that have popped up in a number of places depending on the exact needs of the test:

@zeroshade's utility in cudf (I believe this is something like Arrow C++'s "Equal" in that it non-equal buffer values in null slots are still considered equal)

https://github.com/rapidsai/cudf/blob/af57286536fc21b47b80e45be222773b751600c9/cpp/tests/interop/to_arrow_host_test.cpp#L45-L210

@bkietz's Matcher to support IPC batch roundtrip testing:

struct ArrowArrayViewEqualTo {
  const struct ArrowArrayView* expected;

  using is_gtest_matcher = void;

  bool MatchAndExplain(const struct ArrowArrayView* actual, std::ostream* os) const {
    return MatchAndExplain({}, actual, expected, os);
  }

  static bool MatchAndExplain(std::vector<int> field_path,
                              const struct ArrowArrayView* actual,
                              const struct ArrowArrayView* expected, std::ostream* os) {
    auto prefixed = [&]() -> std::ostream& {
      if (!field_path.empty()) {
        for (int i : field_path) {
          *os << "." << i;
        }
        *os << ":";
      }
      return *os;
    };

    NANOARROW_DCHECK(actual->offset == 0);
    NANOARROW_DCHECK(expected->offset == 0);

    if (actual->length != expected->length) {
      prefixed() << "expected length=" << expected->length << "\n";
      prefixed() << "  actual length=" << actual->length << "\n";
      return false;
    }

    auto null_count = [](const struct ArrowArrayView* a) {
      return a->null_count != -1 ? a->null_count : ArrowArrayViewComputeNullCount(a);
    };
    if (null_count(actual) != null_count(expected)) {
      prefixed() << "expected null_count=" << null_count(expected) << "\n";
      prefixed() << "  actual null_count=" << null_count(actual) << "\n";
      return false;
    }

    for (int64_t i = 0; actual->layout.buffer_type[i] != NANOARROW_BUFFER_TYPE_NONE &&
                        i < NANOARROW_MAX_FIXED_BUFFERS;
         ++i) {
      auto a_buf = actual->buffer_views[i];
      auto e_buf = expected->buffer_views[i];
      if (a_buf.size_bytes != e_buf.size_bytes) {
        prefixed() << "expected buffer[" << i << "].size=" << e_buf.size_bytes << "\n";
        prefixed() << "  actual buffer[" << i << "].size=" << a_buf.size_bytes << "\n";
        return false;
      }
      if (memcmp(a_buf.data.data, e_buf.data.data, a_buf.size_bytes) != 0) {
        prefixed() << "expected buffer[" << i << "]'s data to match\n";
        return false;
      }
    }

    field_path.push_back(0);
    for (int64_t i = 0; i < actual->n_children; ++i) {
      field_path.back() = i;
      if (!MatchAndExplain(field_path, actual->children[i], expected->children[i], os)) {
        return false;
      }
    }
    return true;
  }

  void DescribeTo(std::ostream* os) const { *os << "is equivalent to the array view"; }
  void DescribeNegationTo(std::ostream* os) const {
    *os << "is not equivalent to the array view";
  }
};

A utility comparer with probably terrible failure modes I hacked together (also to support IPC batch roundtrip testing):

void AssertArrayViewIdentical(actual, expected) {
  NANOARROW_DCHECK(actual->dictionary != nullptr);
  NANOARROW_DCHECK(expected->dictionary != nullptr);

  ASSERT_EQ(actual->storage_type, expected->storage_type);
  ASSERT_EQ(actual->offset, expected->offset);
  ASSERT_EQ(actual->length, expected->length);
  for (int i = 0; i < 3; i++) {
    auto a_buf = actual->buffer_views[i];
    auto e_buf = expected->buffer_views[i];
    ASSERT_EQ(a_buf.size_bytes,  e_buf->size_bytes);
    if (a_buf.size_bytes != 0) {
      ASSERT_EQ(memcmp(a_buf.data.data, e_buf.data.data, a_buf.size_bytes), 0);
    }
  }

  ASSERT_EQ(actual->n_children, expected->n_children);
  for (int i = 0; i < actual->n_children; i++) {
    AssertArrayViewIdentical(actual->children[i], expected->children[i]);
  }
}

Implementation for the integration tests (based on JSON and so is probably not suitable for arbitrary input).

arrow-nanoarrow/src/nanoarrow/integration/c_data_integration.cc

Lines 176 to 198 in f74d57c

    
           static ArrowErrorCode ImportBatchAndCompareToJson(const char* json_path, int num_batch, 
        
                                                             ArrowArray* batch, ArrowError* error) { 
        
             nanoarrow::UniqueArray actual(batch); 
        
             MaterializedArrayStream data; 
        
             NANOARROW_RETURN_NOT_OK(MaterializeJsonFilePath(json_path, &data, num_batch, error)); 
        
             nanoarrow::testing::TestingJSONComparison comparison; 
        
             SetComparisonOptions(&comparison); 
        
             NANOARROW_RETURN_NOT_OK(comparison.SetSchema(data.schema.get(), error)); 
        
             NANOARROW_RETURN_NOT_OK( 
        
                 comparison.CompareBatch(actual.get(), data.arrays[0].get(), error)); 
        
             if (comparison.num_differences() > 0) { 
        
               std::stringstream ss; 
        
               comparison.WriteDifferences(ss); 
        
               ArrowErrorSet(error, "Found %d differences:\n%s", 
        
                             static_cast<int>(comparison.num_differences()), ss.str().c_str()); 
        
               return EINVAL; 
        
             } 
        
             return NANOARROW_OK; 
        
           }

The text was updated successfully, but these errors were encountered:

This PR is one possible component to address #577. While in some cases we want a more relaxed comparison that allows (for example) arrays with the same content to be considered equal even if they have different content in null slots, in some cases we really do want an exact match. This PR adds `ArrowArrayViewCompare()` in such a way that the same signature could be used to apply the equality check at a more relaxed validation level when this is implemented in a future PR, but only implements the "identical" level since this is the easiest/most pressing (applies to IPC validation). The messages given by the implementation give the location of the difference but not what the difference actually was. Knowing where the error was is usually sufficient for a higher level runtime (e.g., R, Python, C++) to give a fancier message if they want or need to.

vyasr · 2024-08-16T19:27:40Z

When you've got this implemented, if you get a chance we'd appreciate an issue on the cudf repo indicating that we should update our code! Thank you!

paleolimbot · 2024-08-16T19:38:57Z

Will do! (We've only done the easy-but-less-useful-to-you half right now, where we just check for perfectly identical buffers).

This was referenced Aug 5, 2024

Host implementation of to_arrow using nanoarrow rapidsai/cudf#16297

Merged

feat: Add ArrowArrayViewCompare() to check for array equality #578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add array comparison utility to nanoarrow_testing #577

Add array comparison utility to nanoarrow_testing #577

paleolimbot commented Aug 5, 2024

vyasr commented Aug 16, 2024

paleolimbot commented Aug 16, 2024

Add array comparison utility to nanoarrow_testing #577

Add array comparison utility to nanoarrow_testing #577

Comments

paleolimbot commented Aug 5, 2024

vyasr commented Aug 16, 2024

paleolimbot commented Aug 16, 2024