Pekko Cluster Sharding - Race condition #1508

Susmit07 · 2024-09-29T06:36:10Z

Susmit07
Sep 29, 2024

In a Cluster Sharding what are probabilities of a file being processed once at a given time across all the pods in the cluster?

Each entity in a sharded cluster is uniquely identified by its entityId as far the documentation mentions.

I have 2 doubts:

For each file within a HDFS directory, if we provide a unique entity ID by hashing the file path, will it ensure at a given time the file is processed by exactly one actor in the cluster, or we need to have a locking mechanism in place / or implement the file-processing logic in such a way that it ensure to be idempotent
If there are too many files in the source HDFS directory then there will be good number of actors at a given time will be created - will it add to a performance bottleneck
Considering the above requirements is Cluster Sharding an appropiate technique to adopt for distributed file download (Singleton Cluster mode deployment will ensure exactly one time pull for a file to be processed which is good but problem is it won't scale when directories and files increase, and idle pods in the cluster is another drawback, so we thought of exploring sharding cluster)

Hoping for contributors to provide some insight, grateful !

Susmit07 · 2024-09-29T06:36:55Z

Susmit07
Sep 29, 2024
Author

@Roiocam @pjfanning @raboof

I know you people are the contributors to this repository, and shouldn't be tagging you folks, anyone can watch and reply, but sadly no-one does

To be honest we are deciding on the factors before finalising with Pekko so need your inputs as you folks are already an expert

0 replies

raboof · 2024-09-30T14:21:14Z

raboof
Sep 30, 2024
Collaborator

Cluster Sharding what are probabilities of a file being processed once at a given time across all the pods in the cluster?

Each entity in a sharded cluster is uniquely identified by its entityId as far the documentation mentions.

Yes: in a healthy cluster, you can be confident there will be a maximum of one actor running for each entity id. As such cluster sharding is indeed sufficient to make sure "a file being processed once at a given time across all the pods in the cluster".

will it ensure at a given time the file is processed by exactly one actor in the cluster, or we need to have a locking mechanism in place / or implement the file-processing logic in such a way that it ensure to be idempotent

As mentioned in apache/pekko-connectors#814, on its own cluster sharding is not sufficient to get exactly-once delivery: when this file processing is interrupted for some reason, you need some way to make sure you can decide whether you need to re-start/resume this processing. You don't need any additional locking for this, but indeed making the upload idempotent would help to solve this aspect.

If there are too many files in the source HDFS directory then there will be good number of actors at a given time will be created - will it add to a performance bottleneck

Actors are typically cheap, so in that sense the number of actors will not be a performance bottleneck. However, if there are many files, I could imagine you'd overload your system by starting too many uploads in parallel. You could probably restrict the number of parallel uploads from your HDFS scanning code.

0 replies

Susmit07 · 2024-09-30T15:29:21Z

Susmit07
Sep 30, 2024
Author

What a superb response.

I just now completed my POC, let me share my findings

Yes: in a healthy cluster, you can be confident there will be a maximum of one actor running for each entity id. As such cluster sharding is indeed sufficient to make sure "a file being processed once at a given time across all the pods in the cluster"

You are correct, my POC proved it, the entityId i chose was the hdfs / s3 directory which the schedulers were polling at a specified interval, all the files in a specifie directory were processed in the same k8s pod - No explicit lock needed

As mentioned in apache/pekko-connectors#814, on its own cluster sharding is not sufficient to get exactly-once delivery: when this file processing is interrupted for some reason, you need some way to make sure you can decide whether you need to re-start/resume this processing. You don't need any additional locking for this, but indeed making the upload idempotent would help to solve this aspect.

I am planning to have a object store marker file per directory or entityId and have the status of all successfully processed files under the directory as a metadata to marker file during every poll it should compare the delta, whichever file got successfully processed will be ignored. [checkpoint] (Redis / DynamoDB would be ideal but we have constraint, same with Pekko Persistance of relational DBs)

Actors are typically cheap, so in that sense the number of actors will not be a performance bottleneck. However, if there are many files, I could imagine you'd overload your system by starting too many uploads in parallel. You could probably restrict the number of parallel uploads from your HDFS scanning code.

As of now i am going with actor per directory with the entityId being directory path, but i am curious to know how to restrict the number of parallel uploads from your HDFS scanning code (any suggestions @raboof )

1 reply

raboof Sep 30, 2024
Collaborator

Sounds good!

How to restrict the parallelism depends a bit on what the HDFS scanning code looks like

Susmit07 · 2024-10-01T05:57:21Z

Susmit07
Oct 1, 2024
Author

Much appreciated for your response

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pekko Cluster Sharding - Race condition #1508

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Pekko Cluster Sharding - Race condition #1508

Susmit07 Sep 29, 2024

Replies: 4 comments · 1 reply

Susmit07 Sep 29, 2024 Author

raboof Sep 30, 2024 Collaborator

Susmit07 Sep 30, 2024 Author

raboof Sep 30, 2024 Collaborator

Susmit07 Oct 1, 2024 Author

Susmit07
Sep 29, 2024

Replies: 4 comments 1 reply

Susmit07
Sep 29, 2024
Author

raboof
Sep 30, 2024
Collaborator

Susmit07
Sep 30, 2024
Author

raboof Sep 30, 2024
Collaborator

Susmit07
Oct 1, 2024
Author