Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel evaluation of a CQL query on multiple Cassandra nodes in a cluster #2

Open
acharal opened this issue Jul 23, 2016 · 0 comments

Comments

@acharal
Copy link
Member

acharal commented Jul 23, 2016

Motivation

It would be great to implement a parallel version of the cassandra connector. Assume that the Semagrow execution engine spans over a cluster of nodes and each node can execute part of the execution plan in parallel. Assume also that each each Semagrow node is colocated with a Cassandra node. Then, a single CQL query can be processed in parallel by all the colocated Cassandra and Semagrow nodes and perform some work locally to the physical node.

Suggested Solution

An easy way to retrieve data local to a Cassandra node is with the use of the CQL token function. The same technique is used by the sparql-cassandra-connector (for example see CqlTokenRange and CassandraTableScanRDD). Each Cassandra node gets an altered CQL query with token ranges added in the where clause. For example, suppose that there are 3 nodes in a cluster and the initial CQL query is

SELECT event_description 
FROM events
WHERE event_category = 'Alerts'

Each i-st node will then get a query of the form

SELECT event_description 
FROM events
WHERE token(event_name) >= x_i AND token(event_name) < y_i AND event_category = 'Alerts'

Ideally, the token range [x_i, y_i) matches with the local data of the i-st node and therefore there will be no network exchange. However, in the case that not every Cassandra node participates in a Semagrow computation then some of the nodes will get a query with tokens outside of their range. Cassandra cluster will handle the query by finding which node owns the specific tokens and transfers them to the node that handles the query.

Hope that the suggestion is at least sound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant