2113 using compact incomplete on a library with dynamic schema with a named index can result in an unreadable index #2116

G-D-Petrov · 2025-01-13T12:17:38Z

Reference Issues/PRs

Fixes #2113

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

G-D-Petrov · 2025-01-14T09:04:38Z

cpp/arcticdb/version/schema_checks.hpp

+    auto new_df_field_index_count = new_df_descriptor.index().type() == IndexDescriptor::Type::EMPTY ? 0 : new_df_descriptor.index().field_count();
+
+    // If either index is empty, we consider them to match
+    if (df_in_store_index_field_count == 0 || new_df_field_index_count == 0) {


This check is to accommodate the existing behavior around empty DFs and Series, both of which have essentially empty indexes, even though for series the types is RowCount, I think

G-D-Petrov · 2025-01-14T09:08:57Z

The benchmarks are reporting ~30% performance degradation on the FinalizeStagedData benchmarks:

Change	Before [`d71a0bb`] <v5.2.0rc0~1>	After [`c1b389e`]	Ratio	Benchmark (Parameter)
+	904M	1.45G	1.6	finalize_staged_data.FinalizeStagedData.peakmem_finalize_staged_data(1000)
+	1.88G	2.73G	1.45	finalize_staged_data.FinalizeStagedData.peakmem_finalize_staged_data(2000)
+	1.71±0s	2.33±0s	1.36	finalize_staged_data.FinalizeStagedData.time_finalize_staged_data(1000)
+	3.49±0s	4.66±0s	1.34	finalize_staged_data.FinalizeStagedData.time_finalize_staged_data(2000)

I think that this is due to the new check over all of the segments to make sure that the index names are the same, which was not done before.

IGNORE THIS: The latest commit fixes this - 300ae92

cpp/arcticdb/version/schema_checks.hpp

vasil-pashov · 2025-01-15T10:49:30Z

I think it's worth adding a test for sort_and_finalize_staged_data similar to the one for finalize_staged_data because the codepaths are slightly different

…imports

…nability

…ate docs to reflect new behavior

G-D-Petrov · 2025-01-15T14:33:46Z

cpp/arcticdb/column_store/memory_segment_impl.cpp

@@ -650,7 +650,7 @@ size_t SegmentInMemoryImpl::num_bytes() const {
 void SegmentInMemoryImpl::sort(const std::string& column_name) {
    init_column_map();
    auto idx = column_index(std::string_view(column_name));
-    user_input::check<ErrorCode::E_COLUMN_NOT_FOUND>(static_cast<bool>(idx), "Column {} not found in sort", column_name);
+    schema::check<ErrorCode::E_COLUMN_DOESNT_EXIST>(static_cast<bool>(idx), "Column {} not found in sort", column_name);


Note: I've changed this so it is more consistent with the other similar exceptions

vasil-pashov · 2025-01-15T15:47:12Z

python/tests/unit/arcticdb/version_store/test_parallel.py

+    df_0.index.name = "date"
+    df_1 = pd.DataFrame({"col_0": [1]}, index=pd.date_range("2024-01-02", periods=1))
+    lib.write(sym, df_0)
+    lib.append(sym, df_1, incomplete=True)


I think append(...incomplete=True) is the same as write(...incomplete=True). Let's keep the test. It's questionable design decision that we allow it.

vasil-pashov · 2025-01-15T15:49:22Z

python/tests/unit/arcticdb/version_store/test_sort_merge.py

+
+
+@pytest.mark.parametrize("delete_staged_data_on_failure", [True, False])
+def test_sort_and_finalize_staged_data_write_dynamic_schema_named_index(


nit: Can't this be parametrized by mode=[StagedDataFinalizeMethod.WRITE,, StagedDataFinalizeMethod.APPEND] as well to avoid repetition?

G-D-Petrov requested review from alexowens90, willdealtry and poodlewars as code owners January 13, 2025 12:17

G-D-Petrov commented Jan 14, 2025

View reviewed changes

vasil-pashov reviewed Jan 15, 2025

View reviewed changes

cpp/arcticdb/version/schema_checks.hpp Outdated Show resolved Hide resolved

G-D-Petrov added 7 commits January 15, 2025 16:32

Add named index tests

e83479d

Add index name matching checks to schema validation

53c75a1

Update index name matching logic and adjust StreamDescriptorMismatch …

aec5fdb

…imports

Refactor index name matching logic to improve readability and maintai…

654426e

…nability

Check the index names in finalize staged data on demand

1c20e99

Move schema checks functions to a cpp file

5319604

Add sort_and_merge_tests, update exception to be more consistent, upd…

43d141c

…ate docs to reflect new behavior

G-D-Petrov force-pushed the 2113-using-compact_incomplete-on-a-library-with-dynamic-schema-with-a-named-index-can-result-in-an-unreadable-index branch from 300ae92 to 43d141c Compare January 15, 2025 14:33

G-D-Petrov commented Jan 15, 2025

View reviewed changes

vasil-pashov reviewed Jan 15, 2025

View reviewed changes

vasil-pashov approved these changes Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2113 using compact incomplete on a library with dynamic schema with a named index can result in an unreadable index #2116

2113 using compact incomplete on a library with dynamic schema with a named index can result in an unreadable index #2116

G-D-Petrov commented Jan 13, 2025

G-D-Petrov Jan 14, 2025

G-D-Petrov commented Jan 14, 2025 •

edited

Loading

vasil-pashov commented Jan 15, 2025 •

edited

Loading

G-D-Petrov Jan 15, 2025

vasil-pashov Jan 15, 2025

vasil-pashov Jan 15, 2025



		@pytest.mark.parametrize("delete_staged_data_on_failure", [True, False])
		def test_sort_and_finalize_staged_data_write_dynamic_schema_named_index(

2113 using compact incomplete on a library with dynamic schema with a named index can result in an unreadable index #2116

Are you sure you want to change the base?

2113 using compact incomplete on a library with dynamic schema with a named index can result in an unreadable index #2116

Conversation

G-D-Petrov commented Jan 13, 2025

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

G-D-Petrov Jan 14, 2025

Choose a reason for hiding this comment

G-D-Petrov commented Jan 14, 2025 • edited Loading

vasil-pashov commented Jan 15, 2025 • edited Loading

G-D-Petrov Jan 15, 2025

Choose a reason for hiding this comment

vasil-pashov Jan 15, 2025

Choose a reason for hiding this comment

vasil-pashov Jan 15, 2025

Choose a reason for hiding this comment

G-D-Petrov commented Jan 14, 2025 •

edited

Loading

vasil-pashov commented Jan 15, 2025 •

edited

Loading