Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2113 using compact incomplete on a library with dynamic schema with a named index can result in an unreadable index #2116

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

G-D-Petrov
Copy link
Collaborator

Reference Issues/PRs

Fixes #2113

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

auto new_df_field_index_count = new_df_descriptor.index().type() == IndexDescriptor::Type::EMPTY ? 0 : new_df_descriptor.index().field_count();

// If either index is empty, we consider them to match
if (df_in_store_index_field_count == 0 || new_df_field_index_count == 0) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is to accommodate the existing behavior around empty DFs and Series, both of which have essentially empty indexes, even though for series the types is RowCount, I think

@G-D-Petrov
Copy link
Collaborator Author

G-D-Petrov commented Jan 14, 2025

The benchmarks are reporting ~30% performance degradation on the FinalizeStagedData benchmarks:

Change Before [d71a0bb] <v5.2.0rc0~1> After [c1b389e] Ratio Benchmark (Parameter)
+ 904M 1.45G 1.6 finalize_staged_data.FinalizeStagedData.peakmem_finalize_staged_data(1000)
+ 1.88G 2.73G 1.45 finalize_staged_data.FinalizeStagedData.peakmem_finalize_staged_data(2000)
+ 1.71±0s 2.33±0s 1.36 finalize_staged_data.FinalizeStagedData.time_finalize_staged_data(1000)
+ 3.49±0s 4.66±0s 1.34 finalize_staged_data.FinalizeStagedData.time_finalize_staged_data(2000)

I think that this is due to the new check over all of the segments to make sure that the index names are the same, which was not done before.

IGNORE THIS: The latest commit fixes this - 300ae92

@vasil-pashov
Copy link
Collaborator

vasil-pashov commented Jan 15, 2025

I think it's worth adding a test for sort_and_finalize_staged_data similar to the one for finalize_staged_data because the codepaths are slightly different

@G-D-Petrov G-D-Petrov force-pushed the 2113-using-compact_incomplete-on-a-library-with-dynamic-schema-with-a-named-index-can-result-in-an-unreadable-index branch from 300ae92 to 43d141c Compare January 15, 2025 14:33
@@ -650,7 +650,7 @@ size_t SegmentInMemoryImpl::num_bytes() const {
void SegmentInMemoryImpl::sort(const std::string& column_name) {
init_column_map();
auto idx = column_index(std::string_view(column_name));
user_input::check<ErrorCode::E_COLUMN_NOT_FOUND>(static_cast<bool>(idx), "Column {} not found in sort", column_name);
schema::check<ErrorCode::E_COLUMN_DOESNT_EXIST>(static_cast<bool>(idx), "Column {} not found in sort", column_name);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I've changed this so it is more consistent with the other similar exceptions

df_0.index.name = "date"
df_1 = pd.DataFrame({"col_0": [1]}, index=pd.date_range("2024-01-02", periods=1))
lib.write(sym, df_0)
lib.append(sym, df_1, incomplete=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think append(...incomplete=True) is the same as write(...incomplete=True). Let's keep the test. It's questionable design decision that we allow it.



@pytest.mark.parametrize("delete_staged_data_on_failure", [True, False])
def test_sort_and_finalize_staged_data_write_dynamic_schema_named_index(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can't this be parametrized by mode=[StagedDataFinalizeMethod.WRITE,, StagedDataFinalizeMethod.APPEND] as well to avoid repetition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using compact_incomplete on a library with dynamic schema with a named index can result in an unreadable index
2 participants