Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40592: [C++][Parquet] Implement SizeStatistics #40594

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Mar 16, 2024

Rationale for this change

Parquet format 2.10.0 has introduced SizeStatistics. parquet-mr has also implemented this: apache/parquet-java#1177. Now it is time for parquet-cpp to pick the ball.

What changes are included in this PR?

Implement reading and writing size statistics for parquet-cpp.

Are these changes tested?

Yes, a bunch of test cases have been added.

Are there any user-facing changes?

Yes, now parquet users are able to read and write size statistics.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 17, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 19, 2024
@wgtmac wgtmac marked this pull request as ready for review April 5, 2024 15:39
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 10, 2024
@@ -1631,6 +1694,12 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
page_statistics_->UpdateSpaced(values, valid_bits, valid_bits_offset,
num_spaced_values, num_values, num_nulls);
}
if constexpr (std::is_same_v<T, ByteArray>) {
if (page_size_stats_builder_ != nullptr) {
page_size_stats_builder_->WriteValuesSpaced(values, valid_bits, valid_bits_offset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could somehow gather this at a lower level (based on buffer size of written values instead of having to handle Spaced values separately)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about this. We have different interfaces to write values of BYTE_ARRAY type:

  • dense ByteArray values
  • spaced ByteArray values
  • arrow::Array of String, Binary, and their large variants
  • dictionary-encoded arrow::Array

These interfaces then directly put values into encoders. So here is the last chance to catch BYTE_ARRAY values before encoding.

page_statistics_->Update(*referenced_dictionary, /*update_counts=*/false);
}
if (page_size_stats_builder_) {
page_size_stats_builder_->WriteValues(*referenced_dictionary);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this write, why are we writing values in the dictionary for page size stats? Maybe a comment or a a name value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is write. The binary values are passed in the form of an arrow::DictionaryArray and encoded indices in an arrow::Int32Array. Here we need to restore the referenced values in the dictionary array to precisely build page stats and size stats.

/// \param[in] valid_bits pointer to bitmap representing if values are non-null.
/// \param[in] valid_bits_offset offset into valid_bits where the slice of data begins.
/// \param[in] num_spaced_values length of values in values/valid_bits to inspect.
void WriteValuesSpaced(const ByteArray* values, const uint8_t* valid_bits,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as commented above, I wonder if it is possible to not interwrine values spaced (and the Array option below) into this interface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is anyway required. Perhaps we can provide only the dense interface here and move the logic of dealing with nulls & arrow array to the caller?

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few high level questions/suggestions.

@wgtmac wgtmac force-pushed the size_stats branch 2 times, most recently from 8661324 to 90caf32 Compare July 10, 2024 15:32
@wgtmac
Copy link
Member Author

wgtmac commented Jul 10, 2024

Finally this PR is complete on my side. Please take a look when you have time. Thanks! @emkornfield @pitrou @mapleFU

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @wgtmac . This is a first partial review, I'll go over the rest once these comments are answered or addressed :-)

/// \param size_statistics pointer to the thrift SizeStatistics structure.
/// \param descr column descriptor for the column.
/// \returns SizeStatistics object. Its lifetime is not bound to the input.
static std::unique_ptr<SizeStatistics> Make(const void* size_statistics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using the pimpl idiom, then you should just return a SizeStatistics here, since all the implementation is already inside a std::unique_ptr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversely, you could also remove the pimpl idiom and return a subclass here instead. This is better if you want to be able to pass an optionally null pointer, or store a shared_ptr at some pointer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the pimpl idiom of class FileMetaData:

/// \brief Create a FileMetaData from a serialized thrift message.
static std::shared_ptr<FileMetaData> Make(
const void* serialized_metadata, uint32_t* inout_metadata_len,
const ReaderProperties& properties = default_reader_properties(),
std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR);

Returning a SizeStatistics instead of std::unique_ptr<SizeStatistics> make it impossible to store it in a smart pointer, which is on the contrary of the convention in this codebase.

Returning a subclass requires implementing virtual functions, which will be called frequently at every batch. This is something I want to avoid.

cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
Comment on lines 121 to 129
/// \brief Add repeated repetition level to the histogram.
/// \param num_levels number of repetition levels to add.
/// \param rep_level repeated repetition level value.
void AddRepetitionLevel(int64_t num_levels, int16_t rep_level);

/// \brief Add repeated definition level to the histogram.
/// \param num_levels number of definition levels to add.
/// \param def_level repeated definition level value.
void AddDefinitionLevel(int64_t num_levels, int16_t def_level);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two really useful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry! The name misled me. Can't we name them AddDefinitionLevels and AddRepetitionLevels? Otherwise, these looks like they are adding a single level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not sure why they're taking the explicit rep_level and def_level values. AFAICT, these are only useful to append levels equal to 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. They are just used to append level value 0. It might look more strange if I special case a new function like AppendDefLevelZero(num_levels). And it is convenient to be used in the unit test so I am inclined to keep them.

Comment on lines +136 to +144
void AddValuesSpaced(const ByteArray* values, const uint8_t* valid_bits,
int64_t valid_bits_offset, int64_t num_spaced_values);

/// \brief Add dense BYTE_ARRAY values.
/// \param values pointer to values of BYTE_ARRAY type.
/// \param num_values length of values.
void AddValues(const ByteArray* values, int64_t num_values);

/// \brief Add BYTE_ARRAY values in the arrow array.
void AddValues(const ::arrow::Array& values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more logical for the BYTE_ARRAY encoders to accumulate the unencoded_byte_array_data_bytes, instead of visiting the input data again here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two cases where BYTE_ARRAY encoders do not work:

  1. When dictionary encoding is enabled.
  2. When the input data is in a arrow::DictionaryArray.

cpp/src/parquet/properties.h Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor

going to do another pass through, CI failure looks like a formatting issue.

/// Finalize unencoded_byte_array_data_bytes and make sure page sizes match.
if (offset_index_.page_locations.size() ==
offset_index_.unencoded_byte_array_data_bytes.size()) {
offset_index_.__isset.unencoded_byte_array_data_bytes = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check above is short hand if anything isn't provided? we only expect two states they always match or they never match once page is added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should always match if size stats is enabled. Otherwise, we should expect the list is empty.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 6, 2024
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'm OK with this as long as @pitrou is thank you for driving this.

cpp/src/parquet/column_page.h Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.cc Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Aug 7, 2024

@emkornfield @mapleFU Thanks for the feedback! I haven't addressed all comments from @pitrou yet. Will let you know once ready for review again.

@wgtmac
Copy link
Member Author

wgtmac commented Aug 22, 2024

This is ready for review again. Thanks in advance! @emkornfield @pitrou @mapleFU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants