-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2113 using compact incomplete on a library with dynamic schema with a named index can result in an unreadable index #2116
base: master
Are you sure you want to change the base?
Conversation
auto new_df_field_index_count = new_df_descriptor.index().type() == IndexDescriptor::Type::EMPTY ? 0 : new_df_descriptor.index().field_count(); | ||
|
||
// If either index is empty, we consider them to match | ||
if (df_in_store_index_field_count == 0 || new_df_field_index_count == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is to accommodate the existing behavior around empty DFs and Series, both of which have essentially empty indexes, even though for series the types is RowCount, I think
The benchmarks are reporting ~30% performance degradation on the FinalizeStagedData benchmarks:
I think that this is due to the new check over all of the segments to make sure that the index names are the same, which was not done before. IGNORE THIS: The latest commit fixes this - 300ae92 |
I think it's worth adding a test for |
…ate docs to reflect new behavior
300ae92
to
43d141c
Compare
@@ -650,7 +650,7 @@ size_t SegmentInMemoryImpl::num_bytes() const { | |||
void SegmentInMemoryImpl::sort(const std::string& column_name) { | |||
init_column_map(); | |||
auto idx = column_index(std::string_view(column_name)); | |||
user_input::check<ErrorCode::E_COLUMN_NOT_FOUND>(static_cast<bool>(idx), "Column {} not found in sort", column_name); | |||
schema::check<ErrorCode::E_COLUMN_DOESNT_EXIST>(static_cast<bool>(idx), "Column {} not found in sort", column_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I've changed this so it is more consistent with the other similar exceptions
df_0.index.name = "date" | ||
df_1 = pd.DataFrame({"col_0": [1]}, index=pd.date_range("2024-01-02", periods=1)) | ||
lib.write(sym, df_0) | ||
lib.append(sym, df_1, incomplete=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think append(...incomplete=True)
is the same as write(...incomplete=True)
. Let's keep the test. It's questionable design decision that we allow it.
|
||
|
||
@pytest.mark.parametrize("delete_staged_data_on_failure", [True, False]) | ||
def test_sort_and_finalize_staged_data_write_dynamic_schema_named_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can't this be parametrized by mode=[StagedDataFinalizeMethod.WRITE,, StagedDataFinalizeMethod.APPEND] as well to avoid repetition?
Reference Issues/PRs
Fixes #2113
What does this implement or fix?
Any other comments?
Checklist
Checklist for code changes...