Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature-WIP](iceberg-writer) Implements iceberg partition transform. #36289

Merged
merged 21 commits into from
Jun 22, 2024

Conversation

ghkang98
Copy link
Contributor

@ghkang98 ghkang98 commented Jun 14, 2024

#31442

Added iceberg operator function to support direct entry into the lake by doris

  1. Support insert into data to iceberg by appending hdfs files
  2. Implement iceberg partition routing through partitionTransform
    2.1) Serialize spec and schema data into json on the fe side and then deserialize on the be side to get the schema and partition information of iceberg table
    2.2) Then implement Iceberg's Identity, Bucket, Year/Month/Day and other types of partition strategies through partitionTransform and template class
  3. Transaction management through IcebergTransaction
    3.1) After the be side file is written, report CommitData data to fe according to the partition granularity
    3.2) After receiving CommitData data, fe submits metadata to iceberg in IcebergTransaction

Future work

  • Add unit test for partition transform function.
  • Implement partition transform function with exchange sink turned on.
  • The partition transform function omits the processing of bigint type.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Copy link
Contributor

sh-checker report

To get the full details, please check in the job output.

shellcheck errors

'shellcheck ' returned error 1 finding the following syntactical issues:

----------

In build_auditloader.sh line 5:
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
^--^ SC2034 (warning): ROOT appears unused. Verify use (or export if used externally).


In build_auditloader.sh line 14:
mkdir -p ${AUDITLOADER_DIR}
         ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean: 
mkdir -p "${AUDITLOADER_DIR}"


In build_auditloader.sh line 29:
ls -l $AUDITLOADER_DIR
      ^--------------^ SC2086 (info): Double quote to prevent globbing and word splitting.
      ^--------------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
ls -l "${AUDITLOADER_DIR}"


In build_auditloader.sh line 30:
cp -a $AUDITLOADER_DIR "${DORIS_OUTPUT}/output/"
      ^--------------^ SC2086 (info): Double quote to prevent globbing and word splitting.
      ^--------------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
cp -a "${AUDITLOADER_DIR}" "${DORIS_OUTPUT}/output/"

For more information:
  https://www.shellcheck.net/wiki/SC2034 -- ROOT appears unused. Verify use (...
  https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ...
  https://www.shellcheck.net/wiki/SC2250 -- Prefer putting braces around vari...
----------

You can address the above issues in one of three ways:
1. Manually correct the issue in the offending shell script;
2. Disable specific issues by adding the comment:
  # shellcheck disable=NNNN
above the line that contains the issue, where NNNN is the error code;
3. Add '-e NNNN' to the SHELLCHECK_OPTS setting in your .yml action file.



shfmt errors
'shfmt ' found no issues.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/util/bit_util.h Outdated Show resolved Hide resolved
be/src/util/bit_util.h Outdated Show resolved Hide resolved
be/src/vec/sink/writer/iceberg/partition_transformers .cpp Outdated Show resolved Hide resolved
const std::chrono::time_point<std::chrono::system_clock> PartitionColumnTransformUtils::EPOCH =
std::chrono::system_clock::from_time_t(0);

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'create' exceeds recommended size/complexity thresholds [readability-function-size]

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
                                                                     ^
Additional context

be/src/vec/sink/writer/iceberg/partition_transformers .cpp:30: 158 lines including whitespace and comments (threshold 80)

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
                                                                     ^

be/src/vec/sink/writer/iceberg/partition_transformers.h Outdated Show resolved Hide resolved
Int32* __restrict p_out = out_data.data();

while (p_in < end_in) {
Int64 long_value = static_cast<Int64>(*p_in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
Int64 long_value = static_cast<Int64>(*p_in);
auto long_value = static_cast<Int64>(*p_in);

binary_cast<uint32_t, DateV2Value<DateV2ValueType>>(*(UInt32*)p_in);

int32_t days_from_unix_epoch = value.daynr() - 719528;
Int64 long_value = static_cast<Int64>(days_from_unix_epoch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
Int64 long_value = static_cast<Int64>(days_from_unix_epoch);
auto long_value = static_cast<Int64>(days_from_unix_epoch);

Copy link
Contributor

sh-checker report

To get the full details, please check in the job output.

shellcheck errors

'shellcheck ' returned error 1 finding the following syntactical issues:

----------

In build_auditloader.sh line 5:
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
^--^ SC2034 (warning): ROOT appears unused. Verify use (or export if used externally).


In build_auditloader.sh line 14:
mkdir -p ${AUDITLOADER_DIR}
         ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean: 
mkdir -p "${AUDITLOADER_DIR}"


In build_auditloader.sh line 29:
ls -l $AUDITLOADER_DIR
      ^--------------^ SC2086 (info): Double quote to prevent globbing and word splitting.
      ^--------------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
ls -l "${AUDITLOADER_DIR}"


In build_auditloader.sh line 30:
cp -a $AUDITLOADER_DIR "${DORIS_OUTPUT}/output/"
      ^--------------^ SC2086 (info): Double quote to prevent globbing and word splitting.
      ^--------------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
cp -a "${AUDITLOADER_DIR}" "${DORIS_OUTPUT}/output/"

For more information:
  https://www.shellcheck.net/wiki/SC2034 -- ROOT appears unused. Verify use (...
  https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ...
  https://www.shellcheck.net/wiki/SC2250 -- Prefer putting braces around vari...
----------

You can address the above issues in one of three ways:
1. Manually correct the issue in the offending shell script;
2. Disable specific issues by adding the comment:
  # shellcheck disable=NNNN
above the line that contains the issue, where NNNN is the error code;
3. Add '-e NNNN' to the SHELLCHECK_OPTS setting in your .yml action file.



shfmt errors
'shfmt ' found no issues.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Comment on lines 24 to 25
namespace doris {
namespace vectorized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]

Suggested change
namespace doris {
namespace vectorized {
namespace doris::vectorized {

be/src/vec/sink/writer/iceberg/partition_transformers .cpp:274:

- } // namespace vectorized
- } // namespace doris
+ } // namespace doris

const std::chrono::time_point<std::chrono::system_clock> PartitionColumnTransformUtils::EPOCH =
std::chrono::system_clock::from_time_t(0);

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'create' exceeds recommended size/complexity thresholds [readability-function-size]

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
                                                                     ^
Additional context

be/src/vec/sink/writer/iceberg/partition_transformers .cpp:29: 158 lines including whitespace and comments (threshold 80)

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
                                                                     ^

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

value >>= 8;
if (value == 0 && value_to_save >= 0) { break;
}
if (value == -1 && value_to_save < 0) { break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

Suggested change
if (value == -1 && value_to_save < 0) { break;
if (value == -1 && value_to_save < 0) { break;
}

Comment on lines 24 to 25
namespace doris::vectorized {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]

Suggested change
namespace doris::vectorized {
namespace doris::vectorized {

be/src/vec/sink/writer/iceberg/partition_transformers .cpp:274:

- } // namespace vectorized
- } // namespace doris
+ } // namespace doris

std::chrono::system_clock::from_time_t(0);

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
const doris::iceberg::PartitionField& field, const TypeDescriptor& source_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'create' exceeds recommended size/complexity thresholds [readability-function-size]

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
                                                                     ^
Additional context

be/src/vec/sink/writer/iceberg/partition_transformers .cpp:29: 158 lines including whitespace and comments (threshold 80)

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
                                                                     ^


private:
static const std::chrono::time_point<std::chrono::system_clock> EPOCH;
PartitionColumnTransformUtils() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use '= default' to define a trivial default constructor [modernize-use-equals-default]

Suggested change
PartitionColumnTransformUtils() = default;
PartitionColumnTransformUtils() = default;

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

const std::chrono::time_point<std::chrono::system_clock> PartitionColumnTransformUtils::EPOCH =
std::chrono::system_clock::from_time_t(0);

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'create' exceeds recommended size/complexity thresholds [readability-function-size]

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
                                                                     ^
Additional context

be/src/vec/sink/writer/iceberg/partition_transformers .cpp:28: 158 lines including whitespace and comments (threshold 80)

std::unique_ptr<PartitionColumnTransform> PartitionColumnTransforms::create(
                                                                     ^

@morningman
Copy link
Contributor

run buildall

1 similar comment
@ghkang98
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.35% (9001/24760)
Line Coverage: 27.90% (73718/264178)
Region Coverage: 27.43% (38305/139660)
Branch Coverage: 24.12% (19520/80938)
Coverage Report: http://coverage.selectdb-in.cc/coverage/3b37c014d8e52a5343752838a3282997e95c9d55_3b37c014d8e52a5343752838a3282997e95c9d55/report/index.html

@kaka11chen kaka11chen changed the title Iceberg write [Feature](iceberg-writer) Implements iceberg partition transform and insert overwrite functionalities. Jun 16, 2024

import java.util.Objects;

public class SimpleTableInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Generally, when you override the equal method, you also need to override the hashCode method.
  2. If you just want to express database and table, you can refactor this class:
    DatabaseTableName

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatabaseTableName只是HMSTransaction内部的一个类,这里的simpleTableInfo类是先构建一个能给全局使用的类,这个类的用于记录iceberg内部库表的简单信息


package org.apache.doris.datasource.statistics;

public class CommonStatistics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this class be considered to be merged with HivePartitionStatistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design of CommonStatistics is to abstract the information of HivePartitionStatistics in the future. We cannot do it all at once and need to optimize it step by step.


public static FileFormat getFileFormat(Table icebergTable) {
Map<String, String> properties = icebergTable.properties();
String fileFormatName = properties.getOrDefault(TableProperties.DEFAULT_FILE_FORMAT, "parquet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we delete WRITE_FORMAT and DEFAULT_FILE_FORMAT here?

IcebergTransaction transaction = (IcebergTransaction) transactionManager.getTransaction(txnId);
loadedRows = transaction.getUpdateCnt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the update count not showing now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will increase the count stat

public void beginInsert(String dbName, String tbName) {
Table icebergTable = ops.getCatalog().loadTable(TableIdentifier.of(dbName, tbName));
transaction = icebergTable.newTransaction();
public void pendingCommit(SimpleTableInfo tableInfo) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, we try to keep all table insertions using the same function name, including hive, iceberg, and paimon, hudi, etc. that may be supported later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be unified with hive, but I think pendingCommit, precommit and commit are more suitable, and hive can also move towards this aspect.

@@ -173,4 +177,7 @@ public void dropTable(DropTableStmt stmt) throws DdlException {
catalog.dropTable(TableIdentifier.of(dbName, tableName));
db.setUnInitialized(true);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary modifications

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the modification process, I accidentally pressed a blank line

transaction = icebergTable.newTransaction();
public void pendingCommit(SimpleTableInfo tableInfo) {
this.tableInfo = tableInfo;
this.transaction = getNativeTable(tableInfo).newTransaction();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table in the cache cannot be used here, because the table here is not necessarily the same as the cached table.

return FileContent.POSITION_DELETES;
private void updateManifestAfterInsert(TUpdateMode updateMode) {

Table table = getNativeTable(tableInfo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transactions are isolated and cache tables cannot be used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the getNativeTable function gets the table from IcebergUtil. IcebergUtil is a tool class that is not only used in the current scenario.
Secondly, we should also need to get the latest information of the table.

convertToFileContent(data.getFileContent()),
data.isSetReferencedDataFiles() ? Optional.of(data.getReferencedDataFiles()) : Optional.empty()
));
//create start the iceberg transaction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//create start the iceberg transaction
// create and start the iceberg transaction

List<WriteResult> pendingResults = Lists.newArrayList(writeResult);

if (spec.isPartitioned()) {
LOG.info("{} {} table partition manifest ...", tableInfo, updateMode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this logs to debug, only leave necessary log.

if (LOG.isDebugEnable()) {
    LOG.debug xxx
}

this.partitionValues = convertPartitionValuesForNull(partitionValues);
this.content = content;
this.referencedDataFiles = referencedDataFiles;
private void partitionManifestUp(TUpdateMode updateMode, Table table, List<WriteResult> pendingResults) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private void partitionManifestUp(TUpdateMode updateMode, Table table, List<WriteResult> pendingResults) {
private void partitionManifestUpdate(TUpdateMode updateMode, Table table, List<WriteResult> pendingResults) {


package org.apache.doris.datasource.statistics;

public class CommonStatistics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment for this class.
What is CommonStatistics and why need it?
Because I only see it using in Iceberg

@ghkang98 ghkang98 changed the title [Feature](iceberg-writer) Implements iceberg partition transform and insert overwrite functionalities. [Feature](iceberg-writer) Implements iceberg partition transform Jun 17, 2024
@ghkang98
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.35% (9000/24761)
Line Coverage: 27.90% (73734/264279)
Region Coverage: 27.42% (38300/139697)
Branch Coverage: 24.11% (19522/80970)
Coverage Report: http://coverage.selectdb-in.cc/coverage/7e94854534d2a58cb82bcb81807bc6209495a1a1_7e94854534d2a58cb82bcb81807bc6209495a1a1/report/index.html

@morningman
Copy link
Contributor

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.36% (9003/24762)
Line Coverage: 27.91% (73758/264252)
Region Coverage: 27.42% (38307/139685)
Branch Coverage: 24.12% (19529/80964)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b2f8851731b4903d99e9d791c33aadeabfbeaeda_b2f8851731b4903d99e9d791c33aadeabfbeaeda/report/index.html

@kaka11chen kaka11chen changed the title [Feature](iceberg-writer) Implements iceberg partition transform [Feature-WIP](iceberg-writer) Implements iceberg partition transform. Jun 18, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Int64* __restrict p_out = out_data.data();

while (p_in < end_in) {
Int64 long_value = static_cast<Int64>(*p_in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
Int64 long_value = static_cast<Int64>(*p_in);
auto long_value = static_cast<Int64>(*p_in);

@ghkang98
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.34% (9004/24775)
Line Coverage: 27.89% (73754/264408)
Region Coverage: 27.42% (38316/139735)
Branch Coverage: 24.12% (19527/80970)
Coverage Report: http://coverage.selectdb-in.cc/coverage/fa94498e1462064109809171f2a5b96e0efacbbd_fa94498e1462064109809171f2a5b96e0efacbbd/report/index.html

@morningman
Copy link
Contributor

run compile

1 similar comment
@ghkang98
Copy link
Contributor Author

run compile

@ghkang98
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


//2) get the input data from block
if (column_ptr->is_nullable()) {
const ColumnNullable* nullable_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
const ColumnNullable* nullable_column =
const auto* nullable_column =


//2) get the input data from block
if (column_ptr->is_nullable()) {
const ColumnNullable* nullable_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
const ColumnNullable* nullable_column =
const auto* nullable_column =


//2) get the input data from block
if (column_ptr->is_nullable()) {
const ColumnNullable* nullable_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
const ColumnNullable* nullable_column =
const auto* nullable_column =

Int32* __restrict p_out = out_data.data();

while (p_in < end_in) {
Int64 long_value = static_cast<Int64>(*p_in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
Int64 long_value = static_cast<Int64>(*p_in);
auto long_value = static_cast<Int64>(*p_in);


//2) get the input data from block
if (column_ptr->is_nullable()) {
const ColumnNullable* nullable_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
const ColumnNullable* nullable_column =
const auto* nullable_column =


//2) get the input data from block
if (column_ptr->is_nullable()) {
const ColumnNullable* nullable_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
const ColumnNullable* nullable_column =
const auto* nullable_column =


//2) get the input data from block
if (column_ptr->is_nullable()) {
const ColumnNullable* nullable_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
const ColumnNullable* nullable_column =
const auto* nullable_column =


//2) get the input data from block
if (column_ptr->is_nullable()) {
const ColumnNullable* nullable_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
const ColumnNullable* nullable_column =
const auto* nullable_column =


//2) get the input data from block
if (column_ptr->is_nullable()) {
const ColumnNullable* nullable_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
const ColumnNullable* nullable_column =
const auto* nullable_column =


//2) get the input data from block
if (column_ptr->is_nullable()) {
const ColumnNullable* nullable_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use auto when initializing with a cast to avoid duplicating the type name [modernize-use-auto]

Suggested change
const ColumnNullable* nullable_column =
const auto* nullable_column =

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.35% (9009/24786)
Line Coverage: 27.94% (73888/264483)
Region Coverage: 27.46% (38382/139757)
Branch Coverage: 24.16% (19567/80982)
Coverage Report: http://coverage.selectdb-in.cc/coverage/e34c4ce0e193b3b02cc2cec23269f2e7912c335f_e34c4ce0e193b3b02cc2cec23269f2e7912c335f/report/index.html

@morningman
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39878 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 33fc8fdbe139dff2e6cabc57491bfd3d0603c238, data reload: false

------ Round 1 ----------------------------------
q1	17669	4624	4328	4328
q2	2022	189	188	188
q3	10535	1087	1006	1006
q4	10221	797	810	797
q5	7474	2690	2618	2618
q6	226	136	136	136
q7	949	644	620	620
q8	9218	2090	2115	2090
q9	8768	6516	6518	6516
q10	8964	3673	3789	3673
q11	468	236	235	235
q12	437	236	230	230
q13	17785	2965	3005	2965
q14	289	224	227	224
q15	524	467	468	467
q16	518	371	378	371
q17	971	691	684	684
q18	8155	7453	7339	7339
q19	2628	1503	1537	1503
q20	651	316	325	316
q21	5040	3232	3287	3232
q22	386	340	343	340
Total cold run time: 113898 ms
Total hot run time: 39878 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4415	4278	4258	4258
q2	367	267	263	263
q3	3011	2784	2862	2784
q4	2081	1724	1780	1724
q5	5574	5635	5520	5520
q6	217	134	144	134
q7	2249	1840	1868	1840
q8	3315	3441	3431	3431
q9	8785	8761	8870	8761
q10	4084	3938	3844	3844
q11	592	482	490	482
q12	827	622	657	622
q13	16973	3178	3132	3132
q14	310	271	286	271
q15	530	482	490	482
q16	490	460	419	419
q17	1809	1516	1509	1509
q18	8078	7911	7911	7911
q19	1854	1541	1505	1505
q20	3075	1872	1901	1872
q21	5224	5082	4908	4908
q22	608	542	567	542
Total cold run time: 74468 ms
Total hot run time: 56214 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 171291 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 33fc8fdbe139dff2e6cabc57491bfd3d0603c238, data reload: false

query1	921	388	380	380
query2	6471	2516	2297	2297
query3	6629	211	219	211
query4	18978	17342	17266	17266
query5	3582	489	481	481
query6	255	156	163	156
query7	4592	296	294	294
query8	309	303	290	290
query9	8494	2385	2337	2337
query10	567	322	276	276
query11	10583	10055	10060	10055
query12	117	88	86	86
query13	1642	358	368	358
query14	9983	7460	6341	6341
query15	229	188	189	188
query16	7754	273	267	267
query17	1933	541	513	513
query18	1941	271	270	270
query19	195	153	153	153
query20	95	87	85	85
query21	210	131	126	126
query22	4516	4119	4062	4062
query23	33874	33995	33770	33770
query24	10069	2909	2862	2862
query25	594	398	369	369
query26	709	155	156	155
query27	2267	324	328	324
query28	6190	2143	2153	2143
query29	881	650	609	609
query30	254	157	161	157
query31	1001	758	751	751
query32	94	52	53	52
query33	651	279	289	279
query34	914	493	488	488
query35	750	671	683	671
query36	1158	997	989	989
query37	140	72	74	72
query38	2944	2848	2839	2839
query39	909	818	809	809
query40	214	131	128	128
query41	62	54	52	52
query42	114	100	109	100
query43	587	538	553	538
query44	1135	735	752	735
query45	194	162	164	162
query46	1067	707	726	707
query47	1866	1783	1774	1774
query48	368	303	300	300
query49	846	400	412	400
query50	762	400	392	392
query51	6951	6762	6788	6762
query52	102	90	92	90
query53	367	299	291	291
query54	898	460	452	452
query55	76	72	74	72
query56	307	278	258	258
query57	1144	1051	1074	1051
query58	245	253	255	253
query59	3496	3310	3236	3236
query60	299	269	279	269
query61	94	88	95	88
query62	603	440	454	440
query63	318	280	284	280
query64	8496	2242	1829	1829
query65	3168	3117	3117	3117
query66	751	335	341	335
query67	15524	15040	15122	15040
query68	4535	550	536	536
query69	498	354	333	333
query70	1201	1059	1091	1059
query71	375	281	277	277
query72	7735	5299	2770	2770
query73	739	330	323	323
query74	5973	5488	5523	5488
query75	3343	2704	2653	2653
query76	2249	1014	908	908
query77	437	293	298	293
query78	10419	9882	9855	9855
query79	2879	522	525	522
query80	2135	472	457	457
query81	574	220	219	219
query82	1442	108	161	108
query83	289	176	173	173
query84	270	93	86	86
query85	1373	288	265	265
query86	478	331	326	326
query87	3243	3096	3098	3096
query88	4168	2466	2453	2453
query89	484	380	381	380
query90	1742	192	187	187
query91	128	101	100	100
query92	60	51	52	51
query93	3662	502	494	494
query94	1141	191	184	184
query95	407	318	310	310
query96	593	268	268	268
query97	3277	3098	3104	3098
query98	221	199	192	192
query99	1173	845	880	845
Total cold run time: 270692 ms
Total hot run time: 171291 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.32 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 33fc8fdbe139dff2e6cabc57491bfd3d0603c238, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.04
query3	0.23	0.04	0.05
query4	1.67	0.08	0.09
query5	0.50	0.49	0.49
query6	1.13	0.72	0.73
query7	0.02	0.02	0.01
query8	0.06	0.04	0.04
query9	0.56	0.51	0.48
query10	0.55	0.54	0.55
query11	0.14	0.11	0.11
query12	0.14	0.11	0.12
query13	0.59	0.59	0.61
query14	0.80	0.78	0.77
query15	0.85	0.83	0.80
query16	0.37	0.36	0.35
query17	1.00	0.97	0.96
query18	0.23	0.25	0.25
query19	1.82	1.83	1.75
query20	0.01	0.01	0.01
query21	15.40	0.66	0.65
query22	4.42	7.63	1.50
query23	18.32	1.37	1.25
query24	2.20	0.23	0.23
query25	0.16	0.10	0.09
query26	0.27	0.18	0.17
query27	0.09	0.08	0.08
query28	13.15	1.02	1.00
query29	13.00	3.38	3.38
query30	0.26	0.07	0.06
query31	2.86	0.39	0.38
query32	3.28	0.47	0.47
query33	2.81	2.93	2.85
query34	17.05	4.47	4.46
query35	4.48	4.54	4.55
query36	0.65	0.46	0.49
query37	0.20	0.16	0.15
query38	0.15	0.15	0.15
query39	0.05	0.03	0.04
query40	0.18	0.14	0.13
query41	0.10	0.06	0.04
query42	0.06	0.04	0.04
query43	0.04	0.04	0.04
Total cold run time: 109.96 s
Total hot run time: 30.32 s

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit d8d9f0a into apache:master Jun 22, 2024
25 of 29 checks passed
dataroaring pushed a commit that referenced this pull request Jun 26, 2024
…#36289)

#31442

Added iceberg operator function to support direct entry into the lake by
doris
1. Support insert into  data to iceberg by appending  hdfs files
2. Implement iceberg partition routing through partitionTransform
2.1) Serialize spec and schema data into json on the fe side and then
deserialize on the be side to get the schema and partition information
of iceberg table
2.2) Then implement Iceberg's Identity, Bucket, Year/Month/Day and other
types of partition strategies through partitionTransform and template
class
3. Transaction management through IcebergTransaction
3.1) After the be side file is written, report CommitData data to fe
according to the partition granularity
3.2) After receiving CommitData data, fe submits metadata to iceberg in
IcebergTransaction

### Future work
- Add unit test for partition transform function.
- Implement partition transform function with exchange sink turned on.
- The partition transform function omits the processing of bigint type.

---------

Co-authored-by: lik40 <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
kaka11chen pushed a commit to kaka11chen/doris that referenced this pull request Jul 12, 2024
…apache#36289)

apache#31442

Added iceberg operator function to support direct entry into the lake by
doris
1. Support insert into  data to iceberg by appending  hdfs files
2. Implement iceberg partition routing through partitionTransform
2.1) Serialize spec and schema data into json on the fe side and then
deserialize on the be side to get the schema and partition information
of iceberg table
2.2) Then implement Iceberg's Identity, Bucket, Year/Month/Day and other
types of partition strategies through partitionTransform and template
class
3. Transaction management through IcebergTransaction
3.1) After the be side file is written, report CommitData data to fe
according to the partition granularity
3.2) After receiving CommitData data, fe submits metadata to iceberg in
IcebergTransaction

### Future work
- Add unit test for partition transform function.
- Implement partition transform function with exchange sink turned on.
- The partition transform function omits the processing of bigint type.

---------

Co-authored-by: lik40 <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
morningman added a commit that referenced this pull request Jul 13, 2024
…7692)

## Proposed changes

Cherry-pick iceberg partition transform functionality. #36289 #36889

---------

Co-authored-by: kang <[email protected]>
Co-authored-by: lik40 <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Mingyu Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.5-merged dev/3.0.0-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants