Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace deprecated statefulMapConcat #133

Merged
merged 8 commits into from
Feb 24, 2024

Conversation

pjfanning
Copy link
Contributor

@pjfanning pjfanning commented Feb 17, 2024

I'm not sure if this is better but statefulMapConcat is deprecated.

The code is trying the take a Source[String, _] and to remove duplicates.

Both approaches involve building sets and those sets will consume a lot of memory if there is a lot of data. I don't think this is avoidable.

I am not a Cassandra expert but I think we might be able to get Cassandra to run 'DISTINCT' on the query. That could mean that we could remove the statefulMapConcat/statefulMap stage.

@pjfanning pjfanning marked this pull request as draft February 17, 2024 14:07
@pjfanning pjfanning changed the title [DRAFT] replace deprecated statefulMapConcat replace deprecated statefulMapConcat Feb 22, 2024
@nvollmar
Copy link
Contributor

Cassandra only supports distinct queries on the partition key. Since both persistence_id and tag are the partition key we can't distinct query on the tag alone.

Since this is part of the reconciler that has to be manually called and is already marked as a rather expensive operation, I'd think this change is not too risky.

Copy link
Contributor

@nvollmar nvollmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@pjfanning pjfanning marked this pull request as ready for review February 23, 2024 12:38
@pjfanning
Copy link
Contributor Author

Cassandra only supports distinct queries on the partition key. Since both persistence_id and tag are the partition key we can't distinct query on the tag alone.

Since this is part of the reconciler that has to be manually called and is already marked as a rather expensive operation, I'd think this change is not too risky.

@nvollmar I haven't tried it but if it was possible to sort the query result then the duplicate check would only need to know the last element as opposed to keeping a full set of visited tags. Do you think Cassandra is likely to allow this query to be sorted?

@nvollmar
Copy link
Contributor

@pjfanning Cassandra does not allow to sort by arbitrary columns. You can define a cluster ordering of a table, but that also has limitations. A more "Cassandra way" to solve this would be using a dedicated table to keep all unique tags for example.

Copy link
Member

@Roiocam Roiocam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only one style suggestion

…onciler/AllTags.scala

Co-authored-by: AndyChen(Jingzhang) <[email protected]>
@pjfanning pjfanning merged commit 90d80ad into apache:main Feb 24, 2024
13 checks passed
@pjfanning pjfanning deleted the replace-statefulmap branch February 24, 2024 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants