About the On-stream analytics and storage cost of Fluss #177

JohnZp · 2024-12-13T03:06:02Z

JohnZp
Dec 13, 2024

Fluss is a very good streaming storage, but I encountered some questions during the research.

The On-stream analytics

I used different levels of Flux data and then used Flink SQL to compare queries. I found that when the data reached billions, the query speed decreased significantly, but point query was still excellent.
I created a primary key table based on paimon on Fluss, and then wrote data of different magnitudes into it for different types of queries(union read).

CREATE TABLE orders (
    order_id BIGINT PRIMARY KEY NOT ENFORCED,
    user_id BIGINT,
    product_id BIGINT,
    order_amount DECIMAL(10, 2),
    create_time TIMESTAMP_LTZ(3)
) WITH (
'bucket.num' = '4',
'table.datalake.enabled' = 'true'
);

Query Type	Million Level	Tens of Millions Level	Hundreds of Millions Level
Primary Key Lookup	Milliseconds	Milliseconds	Milliseconds
Aggregated Query	Seconds	10 Seconds	1 Minute
Non-primary Key Random Query	Seconds	10 Seconds	1 Minute

In particular, when the data reaches hundreds of millions, union read takes roughly twice as long as just querying for paimon.
select count() from orders: 55s
select count() from orders$lake: 26s

My question about all of the above is:

For the scenario of union read, querying all the data data, why is it so much slower than just the data on the lake, isn't the incremental part of the fluss all in memory?
Which future version of union read is expected to support query engines like starrocks?

The storage cost

I compared the storage cost of the above orders table

Type	Disk Space	Description
fluss	17.7g	11g (local) + 6.7g * 1 (hdfs 1 replicas)
paimon	2.3g	2.3g (hdfs 1 replicas)
kafka	3.1g	1 replica

Fluss locally takes up a lot of disk costs。
I looked at the roadmap and they are investing in the Zero Disk architecture in the future. Can we completely use HDFS to replace local disks and use remote storage? Can we still maintain the same throughput as Kafka?

Answered by wuchong

Dec 13, 2024

Hi @JohnZp , thanks for the detailed testing! I will answer the questions below:

why is it so much slower than just the data on the lake, isn't the incremental part of the fluss all in memory?

The incremental part is the changelog part from the lake table snapshot time until now. Log and Changlog are stored in local disk (and tiered to remote storage if configured), so it is not stored in memory. The current implementation of union read is very basic with many optimizations planned in future versions. Currently, for a simple count() of union read on the incremental part, it needs to read all the incremental changelog data to the query engine. This is inefficient and can be optimized to …

View full answer

wuchong · 2024-12-13T15:37:02Z

wuchong
Dec 13, 2024
Maintainer

Hi @JohnZp , thanks for the detailed testing! I will answer the questions below:

why is it so much slower than just the data on the lake, isn't the incremental part of the fluss all in memory?

The incremental part is the changelog part from the lake table snapshot time until now. Log and Changlog are stored in local disk (and tiered to remote storage if configured), so it is not stored in memory. The current implementation of union read is very basic with many optimizations planned in future versions. Currently, for a simple count() of union read on the incremental part, it needs to read all the incremental changelog data to the query engine. This is inefficient and can be optimized to read only the metadata (row count) of each log batch.

Which future version of union read is expected to support query engines like starrocks?

We didn't plan a specific version for this feature, but it's very welcome to contribute this feature at any time!

Fluss locally takes up a lot of disk costs

The disk cost looks strange. Could you share the cluster configuration of Fluss?

Can we completely use HDFS to replace local disks and use remote storage?

Yes, in the future, the Zero Disk Architecture can use remote storage to replace local disks and make TabletServers stateless.

Can we still maintain the same throughput as Kafka?

It depends. The throughput limitation will be changed from "disk bandwidth" to "network bandwidth" and throughput provided by the remote storage. And the latency is expected to be increased.

1 reply

JohnZp Dec 16, 2024
Author

Hi @JohnZp , thanks for the detailed testing! I will answer the questions below:

why is it so much slower than just the data on the lake, isn't the incremental part of the fluss all in memory?

The incremental part is the changelog part from the lake table snapshot time until now. Log and Changlog are stored in local disk (and tiered to remote storage if configured), so it is not stored in memory. The current implementation of union read is very basic with many optimizations planned in future versions. Currently, for a simple count() of union read on the incremental part, it needs to read all the incremental changelog data to the query engine. This is inefficient and can be optimized to read only the metadata (row count) of each log batch.

Which future version of union read is expected to support query engines like starrocks?

We didn't plan a specific version for this feature, but it's very welcome to contribute this feature at any time!

Fluss locally takes up a lot of disk costs

The disk cost looks strange. Could you share the cluster configuration of Fluss?

Can we completely use HDFS to replace local disks and use remote storage?

Yes, in the future, the Zero Disk Architecture can use remote storage to replace local disks and make TabletServers stateless.

Can we still maintain the same throughput as Kafka?

It depends. The throughput limitation will be changed from "disk bandwidth" to "network bandwidth" and throughput provided by the remote storage. And the latency is expected to be increased.

Thank you for your very detailed answer! I also don't understand why the local disk takes up so much storage. This is my cluster configuration.


env.java.opts.all: --add-opens=java.base/java.util=ALL-UNNAMED
env.java.home: /home/soft/jdk-21.0.4

zookeeper.address: localhost:2183
default.bucket.number: 1
default.replication.factor: 1
data.dir: /data/fluss/fluss-data
remote.data.dir: hdfs:///tmp/fluss/fluss-remote-data
coordinator.host: localhost
coordinator.port: 9123
tablet-server.host: localhost
tablet-server.id: 0
lakehouse.storage: paimon
datalake.tiered.storage: paimon
paimon.catalog.type: filesystem
paimon.catalog.warehouse: hdfs:///tmp/paimon

luoyuxia · 2024-12-16T01:54:06Z

luoyuxia
Dec 16, 2024
Collaborator

I'd like add some comments about union read. It also depends how many rows remains in Fluss that need to read. More rows remain in Fluss to read, usually slower.

What's more, if it's a primary key table, union read will require to sort merge the rows in Fluss and Paimon which may be time cost. In this part, we haven't do much optimzation. But of course, we can...

1 reply

JohnZp Dec 16, 2024
Author

I'd like add some comments about union read. It also depends how many rows remains in Fluss that need to read. More rows remain in Fluss to read, usually slower.

What's more, if it's a primary key table, union read will require to sort merge the rows in Fluss and Paimon which may be time cost. In this part, we haven't do much optimzation. But of course, we can...

Thank you for your very detailed answer! So, can I understand it this way? When reading incrementally, the stock part is read from the remote HDFS paimon directory, and the incremental part is read from the local disk changelog (or the remote tiered stroge?). The more incremental parts, the slower the reading. After reading and merging, there is also a sorting and merging process.

Also, I have configured Tiered stroge, why do I need to save the full changelog and kv locally?

$ du -h
2.0G	./orders-0/log-3
2.0G	./orders-0/log-2
2.0G	./orders-0/log-1
2.0G	./orders-0/log-0
793M	./orders-0/kv-3/db
793M	./orders-0/kv-3
792M	./orders-0/kv-2/db
792M	./orders-0/kv-2
792M	./orders-0/kv-1/db
792M	./orders-0/kv-1
793M	./orders-0/kv-0/db
793M	./orders-0/kv-0
12G	./orders-0

wuchong · 2024-12-16T12:03:24Z

wuchong
Dec 16, 2024
Maintainer

@JohnZp thank you for the provided configuration and disk usage. I think there are some reasons:

It seems the average log size is much larger than Kafka. 3 possible reasons:

This may be because the update request is too small. How did you write data into the table? If only 1 record for a bucket to update in each request, then the generated Arrow changelog is in-efficient and larger than a simple Avro format. Usually, an Arrow batch size larger than 500KB or 1MB would be better.
Another possible reason is how many duplicated updates? Updates (upsert on the existing key) generate the update_before and update_after log which is 2x larger than insert (upsert on the non-existing key).
Is there any compression when you write data to kafka? Currently, the Arrow log doesn't have any compressions, and we have an issue [Feature] Arrow Log Supports Compressions #187 to address this.

why do I need to save the full changelog and kv locally

All the old log segments (except the active segment) should have been tiered to remote storage. But by default, the local log needs to retain 2 segments (so each bucket consumes <= 2GB size), see table.log.tiered.local-segments.

We don't support storage-compute separation or tiered storage for kv store currently, so we have to store all kv locally. The storage-compute separation is also on Fluss's roadmap, and there are many promising projects (e.g., ForstDB, SlateDB) to help achieve this goal.

In the future, we can optimize the lifecycle of the changelog, currently, it retains for default 7 days. But changelogs before a specific offset can be removed once the KV store is snapshotted after the offset and no consumer is consuming the logs before the offset. You can simply imagine you only need to retain 10 minutes (the kv snapshot interval) of changelogs in the optimal case.

3 replies

JohnZp Dec 17, 2024
Author

@JohnZp thank you for the provided configuration and disk usage. I think there are some reasons:

It seems the average log size is much larger than Kafka. 3 possible reasons:

This may be because the update request is too small. How did you write data into the table? If only 1 record for a bucket to update in each request, then the generated Arrow changelog is in-efficient and larger than a simple Avro format. Usually, an Arrow batch size larger than 500KB or 1MB would be better.

Another possible reason is how many duplicated updates? Updates (upsert on the existing key) generate the update_before and update_after log which is 2x larger than insert (upsert on the non-existing key).

Is there any compression when you write data to kafka? Currently, the Arrow log doesn't have any compressions, and we have an issue [Feature] Arrow Log Supports Compressions #187 to address this.

why do I need to save the full changelog and kv locally

All the old log segments (except the active segment) should have been tiered to remote storage. But by default, the local log needs to retain 2 segments (so each bucket consumes <= 2GB size), see table.log.tiered.local-segments.

We don't support storage-compute separation or tiered storage for kv store currently, so we have to store all kv locally. The storage-compute separation is also on Fluss's roadmap, and there are many promising projects (e.g., ForstDB, SlateDB) to help achieve this goal.

In the future, we can optimize the lifecycle of the changelog, currently, it retains for default 7 days. But changelogs before a specific offset can be removed once the KV store is snapshotted after the offset and no consumer is consuming the logs before the offset. You can simply imagine you only need to retain 10 minutes (the kv snapshot interval) of changelogs in the optimal case.

It seems the average log size is much larger than Kafka. 3 possible reasons:

I inserted all the Kafka data into Fluss. There is no duplicate data in Kafka, all data is auto-incremented by primary key.
No duplicate records in kafka
Kafka uses gzip compression, which may be the main reason

wuchong Dec 17, 2024
Maintainer

Thanks. Yes, gzip compression is the main reason. We already have the #187 in a high priority.

JohnZp Dec 18, 2024
Author

Thanks. Yes, gzip compression is the main reason. We already have the #187 in a high priority.
That sounds great, thanks for your efforts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the On-stream analytics and storage cost of Fluss #177

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

About the On-stream analytics and storage cost of Fluss #177

JohnZp Dec 13, 2024

The On-stream analytics

The storage cost

Replies: 3 comments · 5 replies

wuchong Dec 13, 2024 Maintainer

JohnZp Dec 16, 2024 Author

luoyuxia Dec 16, 2024 Collaborator

JohnZp Dec 16, 2024 Author

wuchong Dec 16, 2024 Maintainer

JohnZp Dec 17, 2024 Author

wuchong Dec 17, 2024 Maintainer

JohnZp Dec 18, 2024 Author

JohnZp
Dec 13, 2024

Replies: 3 comments 5 replies

wuchong
Dec 13, 2024
Maintainer

JohnZp Dec 16, 2024
Author

luoyuxia
Dec 16, 2024
Collaborator

JohnZp Dec 16, 2024
Author

wuchong
Dec 16, 2024
Maintainer

JohnZp Dec 17, 2024
Author

wuchong Dec 17, 2024
Maintainer

JohnZp Dec 18, 2024
Author