Replies: 4 comments 1 reply
-
I know you people are the contributors to this repository, and shouldn't be tagging you folks, anyone can watch and reply, but sadly no-one does To be honest we are deciding on the factors before finalising with Pekko so need your inputs as you folks are already an expert |
Beta Was this translation helpful? Give feedback.
-
Yes: in a healthy cluster, you can be confident there will be a maximum of one actor running for each entity id. As such cluster sharding is indeed sufficient to make sure "a file being processed once at a given time across all the pods in the cluster".
As mentioned in apache/pekko-connectors#814, on its own cluster sharding is not sufficient to get exactly-once delivery: when this file processing is interrupted for some reason, you need some way to make sure you can decide whether you need to re-start/resume this processing. You don't need any additional locking for this, but indeed making the upload idempotent would help to solve this aspect.
Actors are typically cheap, so in that sense the number of actors will not be a performance bottleneck. However, if there are many files, I could imagine you'd overload your system by starting too many uploads in parallel. You could probably restrict the number of parallel uploads from your HDFS scanning code. |
Beta Was this translation helpful? Give feedback.
-
What a superb response. I just now completed my POC, let me share my findings
You are correct, my POC proved it, the entityId i chose was the hdfs / s3 directory which the schedulers were polling at a specified interval, all the files in a specifie directory were processed in the same k8s pod - No explicit lock needed
I am planning to have a object store marker file per directory or entityId and have the status of all successfully processed files under the directory as a metadata to marker file during every poll it should compare the delta, whichever file got successfully processed will be ignored. [checkpoint] (Redis / DynamoDB would be ideal but we have constraint, same with Pekko Persistance of relational DBs)
As of now i am going with actor per directory with the entityId being directory path, but i am curious to know how to restrict the number of parallel uploads from your HDFS scanning code (any suggestions @raboof ) |
Beta Was this translation helpful? Give feedback.
-
Much appreciated for your response |
Beta Was this translation helpful? Give feedback.
-
In a Cluster Sharding what are probabilities of a file being processed once at a given time across all the pods in the cluster?
Each entity in a sharded cluster is uniquely identified by its entityId as far the documentation mentions.
I have 2 doubts:
For each file within a HDFS directory, if we provide a unique entity ID by hashing the file path, will it ensure at a given time the file is processed by exactly one actor in the cluster, or we need to have a locking mechanism in place / or implement the file-processing logic in such a way that it ensure to be idempotent
If there are too many files in the source HDFS directory then there will be good number of actors at a given time will be created - will it add to a performance bottleneck
Considering the above requirements is Cluster Sharding an appropiate technique to adopt for distributed file download (Singleton Cluster mode deployment will ensure exactly one time pull for a file to be processed which is good but problem is it won't scale when directories and files increase, and idle pods in the cluster is another drawback, so we thought of exploring sharding cluster)
Hoping for contributors to provide some insight, grateful !
Beta Was this translation helpful? Give feedback.
All reactions