Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2113 using compact incomplete on a library with dynamic schema with a named index can result in an unreadable index #2116

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

G-D-Petrov
Copy link
Collaborator

Reference Issues/PRs

Fixes #2113

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

auto new_df_field_index_count = new_df_descriptor.index().type() == IndexDescriptor::Type::EMPTY ? 0 : new_df_descriptor.index().field_count();

// If either index is empty, we consider them to match
if (df_in_store_index_field_count == 0 || new_df_field_index_count == 0) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is to accommodate the existing behavior around empty DFs and Series, both of which have essentially empty indexes, even though for series the types is RowCount, I think

@G-D-Petrov
Copy link
Collaborator Author

G-D-Petrov commented Jan 14, 2025

The benchmarks are reporting ~30% performance degradation on the FinalizeStagedData benchmarks:

Change Before [d71a0bb] <v5.2.0rc0~1> After [c1b389e] Ratio Benchmark (Parameter)
+ 904M 1.45G 1.6 finalize_staged_data.FinalizeStagedData.peakmem_finalize_staged_data(1000)
+ 1.88G 2.73G 1.45 finalize_staged_data.FinalizeStagedData.peakmem_finalize_staged_data(2000)
+ 1.71±0s 2.33±0s 1.36 finalize_staged_data.FinalizeStagedData.time_finalize_staged_data(1000)
+ 3.49±0s 4.66±0s 1.34 finalize_staged_data.FinalizeStagedData.time_finalize_staged_data(2000)

I think that this is due to the new check over all of the segments to make sure that the index names are the same, which was not done before.

IGNORE THIS: The latest commit fixes this - 300ae92

@@ -86,6 +86,31 @@ inline void check_normalization_index_match(
}
}

inline bool index_names_match(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add schema_checks.cpp and add this there?

@vasil-pashov
Copy link
Collaborator

vasil-pashov commented Jan 15, 2025

I think it's worth adding a test for sort_and_finalize_staged_data similar to the one for finalize_staged_data because the codepaths are slightly different

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using compact_incomplete on a library with dynamic schema with a named index can result in an unreadable index
2 participants