Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array comparison utility to nanoarrow_testing #577

Open
paleolimbot opened this issue Aug 5, 2024 · 2 comments
Open

Add array comparison utility to nanoarrow_testing #577

paleolimbot opened this issue Aug 5, 2024 · 2 comments

Comments

@paleolimbot
Copy link
Member

There are a number of different types of array comparison that have popped up in a number of places depending on the exact needs of the test:

@zeroshade's utility in cudf (I believe this is something like Arrow C++'s "Equal" in that it non-equal buffer values in null slots are still considered equal)

https://github.com/rapidsai/cudf/blob/af57286536fc21b47b80e45be222773b751600c9/cpp/tests/interop/to_arrow_host_test.cpp#L45-L210

@bkietz's Matcher to support IPC batch roundtrip testing:

struct ArrowArrayViewEqualTo {
  const struct ArrowArrayView* expected;

  using is_gtest_matcher = void;

  bool MatchAndExplain(const struct ArrowArrayView* actual, std::ostream* os) const {
    return MatchAndExplain({}, actual, expected, os);
  }

  static bool MatchAndExplain(std::vector<int> field_path,
                              const struct ArrowArrayView* actual,
                              const struct ArrowArrayView* expected, std::ostream* os) {
    auto prefixed = [&]() -> std::ostream& {
      if (!field_path.empty()) {
        for (int i : field_path) {
          *os << "." << i;
        }
        *os << ":";
      }
      return *os;
    };

    NANOARROW_DCHECK(actual->offset == 0);
    NANOARROW_DCHECK(expected->offset == 0);

    if (actual->length != expected->length) {
      prefixed() << "expected length=" << expected->length << "\n";
      prefixed() << "  actual length=" << actual->length << "\n";
      return false;
    }

    auto null_count = [](const struct ArrowArrayView* a) {
      return a->null_count != -1 ? a->null_count : ArrowArrayViewComputeNullCount(a);
    };
    if (null_count(actual) != null_count(expected)) {
      prefixed() << "expected null_count=" << null_count(expected) << "\n";
      prefixed() << "  actual null_count=" << null_count(actual) << "\n";
      return false;
    }

    for (int64_t i = 0; actual->layout.buffer_type[i] != NANOARROW_BUFFER_TYPE_NONE &&
                        i < NANOARROW_MAX_FIXED_BUFFERS;
         ++i) {
      auto a_buf = actual->buffer_views[i];
      auto e_buf = expected->buffer_views[i];
      if (a_buf.size_bytes != e_buf.size_bytes) {
        prefixed() << "expected buffer[" << i << "].size=" << e_buf.size_bytes << "\n";
        prefixed() << "  actual buffer[" << i << "].size=" << a_buf.size_bytes << "\n";
        return false;
      }
      if (memcmp(a_buf.data.data, e_buf.data.data, a_buf.size_bytes) != 0) {
        prefixed() << "expected buffer[" << i << "]'s data to match\n";
        return false;
      }
    }

    field_path.push_back(0);
    for (int64_t i = 0; i < actual->n_children; ++i) {
      field_path.back() = i;
      if (!MatchAndExplain(field_path, actual->children[i], expected->children[i], os)) {
        return false;
      }
    }
    return true;
  }

  void DescribeTo(std::ostream* os) const { *os << "is equivalent to the array view"; }
  void DescribeNegationTo(std::ostream* os) const {
    *os << "is not equivalent to the array view";
  }
};

A utility comparer with probably terrible failure modes I hacked together (also to support IPC batch roundtrip testing):

void AssertArrayViewIdentical(actual, expected) {
  NANOARROW_DCHECK(actual->dictionary != nullptr);
  NANOARROW_DCHECK(expected->dictionary != nullptr);

  ASSERT_EQ(actual->storage_type, expected->storage_type);
  ASSERT_EQ(actual->offset, expected->offset);
  ASSERT_EQ(actual->length, expected->length);
  for (int i = 0; i < 3; i++) {
    auto a_buf = actual->buffer_views[i];
    auto e_buf = expected->buffer_views[i];
    ASSERT_EQ(a_buf.size_bytes,  e_buf->size_bytes);
    if (a_buf.size_bytes != 0) {
      ASSERT_EQ(memcmp(a_buf.data.data, e_buf.data.data, a_buf.size_bytes), 0);
    }
  }

  ASSERT_EQ(actual->n_children, expected->n_children);
  for (int i = 0; i < actual->n_children; i++) {
    AssertArrayViewIdentical(actual->children[i], expected->children[i]);
  }
}

Implementation for the integration tests (based on JSON and so is probably not suitable for arbitrary input).

static ArrowErrorCode ImportBatchAndCompareToJson(const char* json_path, int num_batch,
ArrowArray* batch, ArrowError* error) {
nanoarrow::UniqueArray actual(batch);
MaterializedArrayStream data;
NANOARROW_RETURN_NOT_OK(MaterializeJsonFilePath(json_path, &data, num_batch, error));
nanoarrow::testing::TestingJSONComparison comparison;
SetComparisonOptions(&comparison);
NANOARROW_RETURN_NOT_OK(comparison.SetSchema(data.schema.get(), error));
NANOARROW_RETURN_NOT_OK(
comparison.CompareBatch(actual.get(), data.arrays[0].get(), error));
if (comparison.num_differences() > 0) {
std::stringstream ss;
comparison.WriteDifferences(ss);
ArrowErrorSet(error, "Found %d differences:\n%s",
static_cast<int>(comparison.num_differences()), ss.str().c_str());
return EINVAL;
}
return NANOARROW_OK;
}

paleolimbot added a commit that referenced this issue Aug 10, 2024
This PR is one possible component to address #577. While in some cases
we want a more relaxed comparison that allows (for example) arrays with
the same content to be considered equal even if they have different
content in null slots, in some cases we really do want an exact match.
This PR adds `ArrowArrayViewCompare()` in such a way that the same
signature could be used to apply the equality check at a more relaxed
validation level when this is implemented in a future PR, but only
implements the "identical" level since this is the easiest/most pressing
(applies to IPC validation).

The messages given by the implementation give the location of the
difference but not what the difference actually was. Knowing where the
error was is usually sufficient for a higher level runtime (e.g., R,
Python, C++) to give a fancier message if they want or need to.
@vyasr
Copy link
Contributor

vyasr commented Aug 16, 2024

When you've got this implemented, if you get a chance we'd appreciate an issue on the cudf repo indicating that we should update our code! Thank you!

@paleolimbot
Copy link
Member Author

Will do! (We've only done the easy-but-less-useful-to-you half right now, where we just check for perfectly identical buffers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants